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The elastic net was introduced as a heuristic algorithm for combinatorial optimisation and has been applied, 
among other problems, to biological modelling. It has an energy function which trades off a fitness term against 
' a tension term. In the original formulation of the algorithm the tension term was implicitly based on a first- 

order derivative. In this paper we generalise the elastic net model to an arbitrary quadratic tension term, 
e.g. derived from a discretised differential operator, and give an efficient learning algorithm. We refer to 
these as generalised elastic nets (GENs). We give a theoretical analysis of the tension term for ID nets with 
periodic boundary conditions, and show that the model is sensitive to the choice of finite difference scheme that 
represents the discretised derivative. We illustrate some of these issues in the context of cortical map models, 
^ , by relating the choice of tension term to a cortical interaction function. In particular, we prove that this 

interaction takes the form of a Mexican hat for the original elastic net, and of progressively more oscillatory 
Mexican hats for higher-order derivatives. The results apply not only to generalised elastic nets but also to 
other methods using discrete differential penalties, and are expected to be useful in other areas, such as data 
analysis, computer graphics and optimisation problems. 



The elastic net was first proposed as a method to obtain good solutions to the travelling salesman problem 
(TSP; Durbin and Willshaw, 1987) and was subsequently also found to be a very successful cortical map model 
(Durbin and Mitchison, 1990; Goodhill and Willshaw, 1990; Erwin et al., 1995; Swindale, 1996; Wolf and Geisel, 
1998). Essentially, it trades off the desire to represent a set of data (cities in the TSP, feature preferences in 
cortical maps) with the desire for this representation to be smooth in the sense of minimising the sum of squared 
distances of neighbouring centroids. The relative success of the elastic net in these applications, its flavour of 
wirelength minimisation, and its ease of implementation, are probable reasons why there has so far been little 
attempt to generalise the model beyond its original formulation, specifically in terms of using more complex tension 
terms. In the TSP context this is perhaps understandable in that the original tension term closely approximates 
the true cost, the sum of nonsquared distances (some investigations of the effect of using alternative exponents 
for the distance function have been performed; Durbin and Mitchison, 1990; Yuille et al., 1996). However, in 
the context of cortical map modelling the tension term represents an abstraction of lateral connections, whose 
functional form is unknown biologically. Although the sum-of-square-distances form was motivated as wirelength 
of neuronal connections, assumed large between neurons having very dissimilar stimulus preferences (Durbin and 
Mitchison, 1990), this begs the question of examining other types of tension terms to see what difference, if any, 
they make, as well as to relate the tension term with biologically significant parameters, such as a description 
of the intracortical connectivity pattern. Dayan (1993) (see also Yuille, 1990; Yuille et al., 1996) was the first 
to suggest this and considered the relation of quadratic tension terms — just like those considered here — with 
other cortical map models. In this paper we investigate generalized tension terms in more detail. Although our 
motivation is primarily biological, the extended model is also expected to be generally useful in areas such as 
computer vision, computer graphics, image processing, unsupervised learning and TSP-type problems. 

The paper consists of three parts. In the first (sections 1-2) we define the generalised elastic net (GEN) via the 
introduction of a quadratic penalty (tension term) in the energy of the net and give efficient learning algorithms 
for it. In the second (sections 3-5) we consider a class of quadratic penalties defined as discretised differential 
operators, and give a theoretical analysis that is applicable not only to generalised elastic nets but also to any 
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model with a Gaussian prior. In the third part (section 6) we demonstrate the results of the previous parts in 
the problem of cortical map modelling and report additional results via simulation. 

Throughout the paper we will often refer to the "original elastic net" meaning the original formulation of the 
elastic net by Durbin and Willshaw (1987), with its sum-of-square-distances tension term. The equations for the 
original elastic net are given in section 2.6. 



1 Probabilistic formulation of the generalised elastic net (GEN) 

Given a collection of centroids {y m }JJ =1 C R D that we express as a D x M matrix Y = f (yi, . . . , vm) and a 
scale, or variance, parameter a G M + , consider a Gaussian-mixture density p(x|Y; a) = J2m=i i?£ > ( x l m ) w ^ n M 
components, where x|m ~ A/"(y m , a 2 I^), i.e., all covariance matrices are equal and isotropic. Consider a Gaussian 

prior on the centroids p(Y; f3) oc e~^ tr ( Y YS ) where (3 is a regularisation, or inverse variance, parameter and S 
is a positive definite or positive semidefinite M x M matrix — in the latter case the prior being improper. The 
normalisation constant of this density will not be relevant for our purposes and is given in appendix A. 

We define a generalised elastic net ( GEN) by p(x.\ Y; a)p(Y; (3). The original elastic net of Durbin and Willshaw 
(1987) and Durbin et al. (1989) is recovered for a particular choice of the matrix S (section 2.6). It is also possible 
to define a prior over the scale a, but we will not do so here, since a will play the role of the temperature in a 
deterministic annealing algorithm. 

Without the prior over centroids, the centroids could be permuted at will with no change in the model, since 
the variable m is just an index. The prior can be used to convey the topologic (dimension and shape) and 
geometric (e.g. curvature) structure of a manifold implicitly defined by the centroids (as if the centroids were 
a sample from a continuous latent variable model; Carreira-Perpihan, 2001). This prior can also be seen as a 
Gaussian process prior (e.g. Bishop et al., 1998a, p. 219), where our matrix S is the inverse of the Gaussian 
process covariance matrix. The semantics of S is not necessary to develop a learning algorithm and so its study 
is postponed to section 3. However, it will be convenient to keep in mind that S will be typically derived from a 
discretised derivative based on a finite difference scheme (or stencil) and that S will be a sparse matrix. 



2 Parameter estimation with annealing 

From a statistical learning point of view, one might wish to find the values of the parameters Y and a that 
maximise the objective function 

logp(Y, <r|X) = logp(X|Y, a) + logp(Y) - logp(X) (1) 

given a training set {x n }^ =1 C R D expressed as a D x TV matrix X = f (xi, . . . , x n ) — that is, maximum-a-posteriori 
(MAP) estimation. For iid data and ignoring a term independent of Y and a, this equation reduces to: 

£map(Y,<t) = £log e-^II^^H 2 -NDloga- |tr (Y T YS) . (2) 

n—1 m=l 

However, as in Durbin and Willshaw (1987) and Durbin et al. (1989), we are more interested in deterministic 
annealing algorithms that minimise the energy function 

N M R 

£(Y,a) d = f -a^log£e-il a ^ir + |tr(Y T YS) (3) 

n=l m=l ~" 

V v ' V v ' 

fitness term tension term 

over Y alone, starting with a large a (for which the tension term dominates) and tracking the minimum to a 
small value of a (for which the fitness term dominates). This is so because (1) one can find good solutions to 
combinatorial optimisation problems such as the TSP (which require a — >■ 0) and to dimension-reduction problems 
such as cortical map modelling (which do not require a — >• 0); and (2) if considered as a dynamical system for a 
continuous latent space, the evolution of the net as a function of <j and the iteration index may model the temporal 
evolution of cortical maps. To attain good generalisation to unseen data, the parameter f3 can be considered a 
hyperparameter and one can look for an optimal value for it given training data, by cross-validation, Bayesian 
inference (e.g. Utsugi, 1997) or some other means. However, in this paper we will be interested in investigating 
the behaviour of the model for a range of f3 values, rather than fixing j3 according to some criterion. 
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We will call the a-term the fitness term, arising from the Gaussian mixture p(X|Y,a), and the /3-term the 
tension term, arising from the prior p(Y). 

Eq. (3) differs from the MAP objective function of eq. (2) in an immaterial change of sign, in the deletion of a 
now constant term TVDloga, and in the multiplication of the fitness term by acr, where a is a positive constant. 
The latter reduces the influence of the fitness term with respect to the tension term as a decreases. Since a can 
be absorbed in /?, we will take a = 1 hereafter. However, it can be useful to have a different a n for each training 
point x n in order to simulate a training set that overrepresents some data points over others (see section 2.5.1). 

We derive three iterative algorithms for minimising E (gradient descent, matrix iteration and Cholesky fac- 
torisation), all based on the gradient of E: 

! = ->W-YG) + /3 Y^±^) (4) 



dY a 

where we define the weight matrix Wat x m = (w nm ) and the invertible diagonal matrix GmxM = diag (g m ) as 



1 II x n -y m ||2 N 
d_e_f e ^ || a H ^ f „ 

^nm — M „ _ v , M 2 9m — / . ^nm • 

M _i ri y m , ^ 

> , * e 2 II II n=l 



The weight w nm is also the responsibility p(ra|x n ) of centroid fj, m for generating point x n , and so g m is the total 
responsibility of centroid /x m (or the average number of training points assigned to it). The matrix XW is then a 
list of average centroids. All three algorithms, particularly the matrix- iteration and Cholesky- factorisation ones, 
benefit considerably from the fact that the matrix S will typically be sparse (with a banded or block-banded 
structure). 



2.1 Gradient descent 

We simply iterate Y^ r+1 ^ = f Y^ r ^ + AY^ r ^ with AY^ r ^ = f —a §^| Y ( T )- However, this only converges if the ratio 
/3/a is small (for large problems, it needs to be very small). For the original elastic net this is easily seen for a 
net with two centroids: the gradient of the tension term at each centroid points towards the other, and if the 
gradient step is larger than the distance between both centroids, the algorithm diverges exponentially (as we have 
confirmed by simulation). This could be alleviated by taking a shorter gradient step (by taking r^AY^ with 
77 < 1), but for large f3/a the step would become very small. 



2.2 Iterative matrix method 

Equating the gradient of eq. (4) to zero we obtain the nonlinear system 

YA = XW with A = G + cr^ ^~~2~~~) ' (5) 

This equation is similar to others obtained under related modelling assumptions (Yuille et al., 1996; Dayan, 1993; 
Bishop et al., 1998b, a; Utsugi, 1997). A is a symmetric positive definite M x M matrix (since S is positive 
(semi) definite), and both A (through G) and W depend on Y. This equation is the basis for a fixed-point 
iteration, where we solve for Y with G and W fixed, then update G and W for the new Y, and repeat till 
convergence. Convergence is guaranteed since we can derive an analogous procedure as an EM algorithm that 
optimises over Y for fixed a as in Yuille et al. (1994) and Utsugi (1997). Essentially, eq. (5) becomes the familiar 
Gaussian-mixture update for the means with (3 = 0. Thus, from the algorithmic point of view, it is the addition 
of a£ ( S+ 2 ST ) that determines the topology. 

For the class of matrices S studied in this paper, A is a large sparse matrix (of the order of 10 4 x 10 4 ) so 
that explicitly computing its inverse is out of the question: besides being a numerically ill-posed operation, it 
is computationally very costly in time and memory, since A -1 is a large, nonsparse matrix. Instead, we can 
solve the system for Y (with A and W fixed) by an iterative matrix method (see e.g. Isaacson and Keller, 1966; 
Smith, 1985). Rearrange the system (5) as Ax = b (calling x the unknowns) with A = D — L — U (diagonal — 
lower triangular — upper triangular) to give an iterative procedure x( r+1 ) = Cx( r ) + c, a few iterations of which 
will usually be enough to compute x (or Y in eq. (4)) approximately, assuming the procedure converges. The 
following procedures are common: 
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Jacobi Decomposing the equation as Dx = (L + U)x + b results in the iterative procedure x^ r+1 ^ = D 1 (L + 
U)x( r ) + D _1 b. Since D is diagonal, D _1 (L + U) can be efficiently computed and remains sparse. 

Gauss-Seidel Decomposing the equation as (D — L)x = Ux + b results in the iterative procedure x( r+1 ) = 
(D - L) - 1 Ux( r ) + (D - L) - 1 b, which can be implemented without explicitly computing (D — L) 1 if instead 
of computing x( r+1 ) from x^ r \ we use the already computed elements x^ +1 \ . . . , a^ r+1 ^ as well as the old 
ones x^ 2 , ... to compute x^ 7 ^ 1 ^; see e.g. eq. (7d). 

Successive overrelaxation (SOR) is also possible and can be faster, but requires setting of the relaxation 
parameter by trial and error. 

For sparse matrices, both Jacobi and Gauss-Seidel are particularly fast and respect the sparsity structure at each 
iteration, without introducing extra nonzero elements. Both methods are quite similar, though for the kind of 
sparse matrix A that we have with the elastic net, Gauss-Seidel should typically be about twice as fast than 
Jacobi and requires keeping just one copy of Y in memory. 

The matrix iterates can be interleaved with the updates for G and W. 

2.3 Direct method by Cholesky factorisation 

We can obtain Y efficiently and robustly by solving the sparse system of equations (5) by Gaussian elimination 
via Cholesky decomposition 1 , since A is symmetric and positive definite. Specifically, the procedure consists of 
(George and Liu, 1981; Duff et al., 1986): 

1. Ordering: find a good permutation P of the matrix A. 

2. Cholesky factorisation: factorise the permuted matrix PAP T into LL T where L is lower triangular with 
nonnegative diagonal elements. 

3. Triangular system solution: solve LYo = PXW and L T Yi = Yo both by Gaussian elimination in O(M) 
and then set Y = P T Yi. 

Step 1 is not strictly necessary but usually accelerates the procedure for sparse matrices. This is because, although 
the Cholesky factorisation does not add zeroes outside the bands of A (and thus preserves its banded structure), 
it may add new zeroes inside (i.e., add new "fill"), and it is possible to reduce the number of zeroes in L by 
reordering the rows of A. However, the exact minimisation of the fill is NP-complete and so one has to use a 
heuristic ordering method, a number of which exist, such as minimum degree ordering (George and Liu, 1981, 
pp. 115-137). 

2.4 Method comparison 

The gradient descent method as defined earlier converges only for /3/a values smaller than a certain threshold 
that is extremely small for large nets; e.g. Durbin and Willshaw (1987) used /3/a = 10 for a 2D TSP problem of 
TV = 100 cities and a net with M = 250 centroids; but Durbin and Mitchison (1990) used /3/a = 0.00025 for a 
4D cortical map model of M = 1600 centroids. For large problems, a tiny ratio /3/a means that — particularly 
for fast annealing — the tension term has very little influence on the final net, and as a result there is no benefit 
in using one matrix S over another. 

A sufficient condition for the convergence of both matrix iteration methods (Jacobi and Gauss-Seidel) is that 
the matrix A be positive definite (which it is). Also, the Cholesky factorisation is stable without pivoting for all 
symmetric positive definite matrices, although pivoting is advisable for positive semidefinite matrices (Golub and 
van Loan, 1996). Thus, all three methods are generally appropriate. However, for high values of /3/a and if S 
is positive semidefinite, then A becomes numerically close to being positive semidefinite (even in later stages of 
training, when G has large diagonal elements). In this case, the Jacobi and Gauss-Seidel methods may diverge, as 
we have observed in practice (and converged if /3/a was lowered). In contrast, we have never found the Cholesky 
factorisation method to diverge, at least for the largest problems we have tried with TV, M « 2 • 10 4 , D = 5, 
stencils of up to order p = 4 and /3/a up to 10 6 . 

The Cholesky factorisation is a direct method, computed in a finite number of operations G(^M 3 ) for dense A 
but much less for sparse A. This is unlike the Jacobi or Gauss-Seidel methods, which are iterative and in principle 

1 David Willshaw (pers. comm.) has also developed independently the idea of the Cholesky factorisation for the original elastic 
net. 
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require an infinite number of iterations to converge (although in practice a few may be enough). With enough 
iterations (a few tens, for the problems we tried), the Jacobi method converges to the solution of the Cholesky 
method; with as few as 5 to 10, it gets a reasonably good one. In general, the computation time required for the 
solution of eq. (5) by the Cholesky factorisation is usually about twice as high as that of the Jacobi method. The 
bottleneck of all elastic net learning algorithms is the computation of the weight matrix W of squared distances 
between cities and centroids, which typically takes several times longer than solving eq. (5). Thus, using the 
Cholesky factorisation only means an increase of around 10-20% of the total computation time. 

Regarding the quality of the iteration, the Cholesky method (and, for enough iterations, the Jacobi and Gauss- 
Seidel methods) goes deeper down the energy surface at each annealing iteration; this is particularly noticeable 
when the net collapses into its centre of mass for large a. In contrast, gradient descent takes tiny steps and requires 
many more iterations for a noticeable change to occur. It might then be possible to use faster annealing than with 
the gradient method, with considerable speedups. On the other hand, the Cholesky method is not appropriate 
for online learning, where training points come one at a time, because the net would change drastically from one 
training point to the next. However, this is not a problem practically since the stream of data can always be split 
into chunks of appropriate size. 

In summary, the robustness and efficiency of the (sparse) Cholesky factorisation make it our method of choice 
and allow us to investigate the behaviour of the model for a larger range of j3 values than has previously been 
possible. All the simulations in this paper use this method. 

2.5 Practical extensions 

Here we describe two practically convenient modifications of the basic elastic net model. 
2.5.1 Weighting points of the training set 

For some applications it may be convenient to define a separate a n > for each data point x n (e.g. to overrepresent 
some data points without having to add extra copies of each point, which would make TV larger). In this case, 
the energy becomes 

N M R 

E(Y, a) = -aJ2 <*n log £ e"* 11*^11' + | tr (Y^YS) 
and all other equations remain the same by defining the weights as 



def ^ / I \ 

ypM -§ n a m \\ 

that is, multiplying the old w nm times a n . 

Over- or underrepresenting training set points is useful in cortical map modelling to simulate deprivation 
conditions (e.g. monocular deprivation, by reducing a n for the points associated with one eye) or nonuniform 
feature distributions (e.g. cardinal orientations are overrrepresented in natural images). It is also possible to make 
a n dependent on a, so that e.g. the overrepresentation may take place at specific times during learning; this is 
useful to model critical periods (Carreira-Perpihan et al., 2003). 

2.5.2 Introducing zero- valued mixing proportions 

We defined the fitness term of the elastic net as a Gaussian mixture with equiprobable components. Instead, 
we can associate a mixing proportion 7r m = f p(m) with each component m, subject to 7r m G [0,1] Vra and 
X)m=i 71 : m = 1- In an unsupervised learning setting, we could take {7r m }^f =1 as parameters and learn them from 
the training set, just as we do with the centroids {y m }m=i- However, we can also use them to disable centroids 
from the model selectively during training (by setting the corresponding 7r m to zero). This is a computationally 
convenient strategy to use non-rectangular grid shapes in 2D nets. Specifically, for each component m to be 
disabled, we need to: 

• Set 7r m = (and renormalise all 7r m ). 

• If using a symmetric matrix S = D T D, set to zero column m of D and all rows m' of D that had a nonzero 
in the element corresponding to column m. This is equivalent to eliminating from the tension term all 
linear combinations that involved y m . This implicitly assumes a specific type of boundary condition; other 
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types of b.c. may be used by appropriately modifying such rows (rather than zeroing them). If S is not 
symmetric, the manipulations are more complicated. 

It is easy to see that the energy is now independent of the value of y m : no "force" is exerted on y m either from 
the fitness or the tension term, and likewise y m exerts no tension on any other centroid. Thus, in the training 
algorithm, we can simply remove it (and the appropriate parts of S, etc.) from the update equations. We could 
also simply leave it there, since its gradient component is zero and its associated equation (5) is zero both in 
the RHS and the LHS (for all d = 1, . . . , D). However, the latter option is computationally wasteful, since the 
operations associated with y m are still being carried out, and can lead to numerical instability in the iterative- 
matrix and Cholesky-factorisation methods, since the matrices involved become singular (although this can be 
easily overcome by setting g m to some nonzero value). 

When mixing proportions 7r m and training set weights a n are used, the energy becomes: 

N M R 

E(Y, a) = -aJ2 « n log ]T n m e~ * " ^ 11° + f tr (Y T YS) 

n—1 m=l 

and all other equations remain the same by defining the weights as 



_ 1 Xn-Ym 



def 

W nrn = CX 7] 



a n p(m|x n ). 



Selectively disabling centroids is useful in cortical map modelling to use a 2D net that approximates the shape 
of primary visual cortex and may include lesions (patches of inactive neurons in the cortex). It can also be used 
to train separate nets on the same training set and so force them to compete with each other (both with ID or 
2D nets); in fact, the central-difference stencil (section 5.5.2) leads to a similar situation by separating the tension 
term into decoupled subterms. 



2.6 Comparison with the original elastic net model 

The original elastic net results from using the matrix D corresponding to a stencil (0, —1, 1) and S = 
(see section 4). For example, for a ID net with M = 9 centroids (where a = for nonperiodic b.c. and a 
periodic b.c): 



D T D 

= 1 for 



D = 



/-l 1 
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a 2 + l 



(6) 



We obtain the original elastic net equations (Durbin and Willshaw, 1987; Durbin et al., 1989) as follows: 



N M R M 

Energy: E(Y,a) = -aa ^ log ^ e'^^^^^ + | ^ 



HYm+l - Yn 



Jacobi: 



Gauss-Seidel: 



m=l 



Gradient: Ay m = a ^ w nm (x n - y m ) + /3a(y m+ i - 2y m + y m _i) 



n=l 



,(t+1) 



,(t+1) 



W nrrl ~X. n 



-y£i) 



n -l u ?im + 2/?<7 



,x„ + My^i + yL-i 1 



(7a) 
(7b) 
(7c) 
(7d) 



The original elastic net papers used annealing with either gradient descent (Durbin and Willshaw, 1987; Durbin 
et al., 1989) or Gauss-Seidel iteration (Durbin and Mitchison, 1990). For the latter, the updates of yi,. . . ,y m 
must be done sequentially on the index m. 
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3 Construction of the matrix S 



The definition of the model and the optimisation algorithms given depend on the matrix S; we are left now with 
the search for a meaningful matrix S that incorporates some knowledge of the problem being modelled. For 
example, for the original elastic net, the tension term embodies an approximation to the length of the net, which 
is the basis for a heuristic solution of the TSP and for a cortical map model. In this section we discuss general 
properties of the matrix S and then, in the next section, concentrate on differential operators. 

Firstly, note that the formulation of the tension term (ignoring the f3/2 factor) as tr (Y T YS) = m / d s rnrn ^ydmydm f 

is not the most general quadratic form of the parameters {ydm}^d=v ^ ms * s because the cross-terms ydraVd'ra' 
for d 7^ d! have weight zero and the terms ydrnVdm' have a weight <s mm / independent of the dimension d. 
Thus, the different dimensions of the net d = (or maps) are independent and formally identical in 

the tension term (though not so in the fitness term). The most general quadratic form could be represented as 
vec (Y) S vec (Y) where vec (Y) is the column concatenation of all elements of Y and S is now MD x MD, or 
as E m m 1 d d' s radm' d'VradVra' d' where s m dm'd' is a 4D tensor. However, our more particular class of penalties still 
has a rich behaviour and we restrict ourselves to it. 

Secondly, note that it is enough to consider symmetric matrices S of the form D T D where D is arbitrary. We 
have that, for a nonsymmetric matrix S: 



tr (y t Y (^Lt?T) > ) = I ( tr (Y T YS) + tr (Y T YS T )) = \ (tr (Y T YS) + tr ((Y T YS T ) T )) = tr (Y T YS) 



and so using S is equivalent to using the symmetric form + 2 . Further, since S must be positive (semi) definite 
for the energy to be lower bounded, we can always write S = D T D for some real D matrix, without loss of 



generality, and so the tension term can be written tr (Y T YS) = ||DY T || where ||A|| = f y^tr (A T A) = y a\- 
is the Frobenius matrix norm. However, the learning algorithms given earlier can be used for any S. 

The D matrix has two functions. As neighbourhood relation, it specifies the strength of the tension between 
centroids and thus the expected metric properties of the net, such as its curvature. As adjacency matrix, it 
specifies what centroids are directly connected 2 and so the topology of the net. Thus, by changing D, we can 
turn a given collection of centroids into a line, a closed line, a plane sheet, a torus, etc. Practically, we typically 
concern ourselves with a fixed topology and are interested in the effects of changing the "elasticity" or "stiffness" 
of the net via the tension term. 

Thirdly, for net topologies where the neighbourhood relations do not depend on the actual region in question 
of the net, we can represent the matrix D in a compact form via a stencil and suitable boundary conditions 
(b.c). Multiplication by D then becomes convolution by the stencil. Further, in section 5.6 we will show that 
with periodic b.c. we can always take D to be a symmetric matrix and so S = D 2 . 

In summary, we can represent quadratic tension terms in full generality through an operator matrix D, and we 
will concentrate on a particular but important class of matrices D, where different dimensions are independent and 
identical in form, and where the operator is assumed translationally invariant so that D results from convolving 
with a stencil. Before considering appropriate types of stencil, we note some important invariance properties. 

3.1 Invariance of the penalty term with respect to rigid motions of the net 

Consider an orthogonal D x D matrix R and an arbitrary D x 1 vector /x. In order that the matrix D represent 
an operator invariant to rotations and translations of the net centroids, we must have the same penalty for Y 
and for R(Y + jzl T ) (where 1 is the Mxl vector of ones): 



Thus, any D will provide invariance to rotations, but invariance to translations requires Dl = 0, i.e., a differential 
operator matrix D (see later). 

Another way to see this, for the circulant matrix case discussed later, is as follows: a circulant positive definite 
matrix S = D T D can always be decomposed as S + /cll T with S circulant positive semidefinite (verifying SI = 0, 
or E m =i si m = 0) and k = -^2-l T Sl = S m =i Sim > 0. This corresponds to the product of two priors, one 

2 However, the notion of adjacency becomes blurred when we consider that a sparse stencil such as (0, —1, 1) can be equivalent 
(i.e., produce the same S) to a nonsparse one, as we show in section 5.6. 





tr (Y T YD T D) = tr ((Y + /zl T ) T R T R(Y + /zl T )D T D) 

= tr (Y T YD T D) + tr (Y T /il T D T D) + tr (l/i T YD T D) + tr (l/Lt T /Ltl T D T D) . 
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given by the differential operator matrix S and the other one by the matrix kll T . The effect of the latter is 
to penalise non-zero-mean nets, since tr (Y T Y(/cll T )) = k^2^ =1 (^m=i 2/dm) 2 > which we do not want in the 
present context, since it depends on the choice of origin for the net. Such matrices can be desirable for smoothing 
applications, naturally, and the training algorithms still work in this case because S remains positive semidefinite 
(and so the energy function is bounded below). 

3.2 Invariance of S with respect to transformations of D or the stencil 

There can be different matrices D that result in the same S = D T D; in fact, D T is a square root of S. Any 
matrix D' = RD will produce the same S if R is a k' x k matrix verifying R T R = This is the same sort of 
unidentiflability as in factor analysis (by rotation of the factors; Bartholomew, 1987), and in our case it means that 
the same nonnegative quadratic form can be obtained for many different matrices D. Particular cases include: 

• Orthogonal rotation of D: R is square with R _1 = R T . 

• Sign reversal of any number of rows of D: R is diagonal with ±1 elements. 

• Permutation of rows of D (i.e., reordering of the summands in the tension term): R is a permutation matrix. 

• Insertion of rows of zeroes to D: R = Hk'xk is the identity I& with k' — k intercalated rows of zeroes. 
When D is derived from a stencil, S is invariant with respect to the following stencil transformations: 

• Sign reversal, e.g. (1, —2, 1) is equivalent to (—1, 2, —1). 

• Shifting the stencil, since this is equivalent to permuting rows of D. In particular, padding with zeroes the 
borders of the stencil (but not inserting zeroes), e.g. the forward difference (0, —1, 1) is equivalent to the 
backward difference (—1, 1, 0) but not to the central difference (—1, 0, 1). 

These invariance properties are useful in the analysis of section 5 and in the implementation of code to construct 
the matrix S given the stencil. 

4 Construction of the matrix D from a differential stencil c 

To represent the matrix D by a stencil consider for example the original elastic net, in which the tension term con- 
sists of summing terms of the form ||y m +i — ym|| 2 over the whole net. In this case, the stencil is (0, —1, 1) and con- 
tains the coefficients that multiply y m _i, y m and y m+ i. In general, a stencil c = fc-fc, S-k+i, • • • , So, • • • , Cfc-i, Sk) 
represents a linear combination J2i=-k SiYm+i of which then we take the square. A given row of matrix D is 
obtained by centring the stencil on the corresponding column, and successive rows by shifting the stencil. If 
the stencil is sparse, i.e., has few nonzero elements, then D and S are sparse (with a banded or block-banded 
structure). 

It is necessary to specify boundary conditions when one of the stencil ends overspills near the net boundaries. 
In this paper we will consider only the simplest types of boundary conditions: periodic, which uses modular 
arithmetic; and nonperiodic or open, which simply discards linear combinations (rows of D) that overspill. In 
both cases the resulting D matrix is a structured matrix: circulant for periodic b.c, since it is obtained by 
successively rotating the stencil (see section 5); or quasi- Toeplitz for nonperiodic b.c, being almost completely 
defined by its top row and left column. The following example illustrates these ideas for a stencil c = (&, b, c, d, e) 
and a ID net with M = 7: 

Nonperiodic b.c: D = a b c d e Periodic b.c: D 

\0 0abcdeJ 

These ideas generalise directly to elastic nets of two or more dimensions, but the structure of D is more compli- 
cated, being circulant or quasi- Toeplitz by blocks. In the analysis of section 5 we consider only periodic b.c. for 
simplicity, but the examples of section 6 will include both periodic and nonperiodic b.c. 

Some notational remarks. As in the earlier example, we will assume the convention that the first row of D 
contains the stencil with its central coefficient in the first column, the coefficients to the right of the stencil occupy 
columns 2, 3, etc. of the matrix and the coefficients to the left occupy columns M, M — 1, etc. respectively. The 



' c d e a b N 
b c d e a 
a b c d e 
a b c d e 
a b c d e 
e a b c 

^deOOabc; 
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remaining rows of D are obtained by successively rotating the stencil. Without loss of generality, the stencil 
must have an odd number of elements to avoid ambiguity in the assignation of these elements to the net points. 
Occasionally we will index the stencil starting from the left, e.g. q = £i> £2? £3? £4)- We will always assume that 
the number of centroids M in the net (the dimension of the vector y) is larger than the number of coefficients 
in the stencil and so will often need to pad the stencil with zeroes. Rather than explicitly notating all these 
circumstances, which would be cumbersome, we will make clear in each context which version of the indexing 
we are using. This convention will make easy various operations we will need later, such as rotating the stencil 
to build a circulant matrix D, convolving the stencil with the net or computing the Fourier transform of the 
stencil. The tension term separates additively into D terms, one for each row of Y, so we consider the case 
D = 1; call y = f Y T the resulting Mxl vector. In this formulation, the vector D T y is the discrete convolution 
of the net and the stencil, while the vector Dy (which is what we use) is the discrete convolution of the net 
and the reversed stencil V == (. • • £2>Ci> Co? C-i,C-2 • • • )• ^ n both cases the tension value is the same. This can 
be readily seen by noting that ^ki^kj — J2fc=i dikdjk if D is circulant, which implies D T D = DD T and 

so ||DY T || 2 = ||D T Y T || 2 ; or by noting that the stencil and the reverse stencil have the same power spectrum 

(N 2 = |V fc | 2 ). 



4.1 Types of discretised operator 

Turning now to the choice of stencil, there are two basic types: 

Differential, or roughening This results when the stencil is a finite-difference approximation to a continuous 
differential operator, such as the forward difference approximation to the first derivative (Conte and de Boor, 
1980): 

A»)« y{u + h) h - y{u) ^<; = io, -1, i) 

where the grid constant h is small. Differential operators characterise the metric properties of the function 
y, such as its curvature, and are local, their value being given at the point in question. Consequently, the 
algebraic sum of the elements of a finite-difference stencil must be zero (otherwise, the operator's value 
value would diverge in the continuum limit, as h — » 0). This zero-sum condition can also be expressed as 
Dl = 0, where 1 is a vector of ones, and has the consequences that: D is rank-defective (having a zero 
eigenvalue with eigenvector 1); S is positive semidefinite and the prior on the centroids is improper; and 
the tension term is invariant to rigid motions of the centroids (translations and rotations). Note the fitness 
term is invariant to permutations of Y but not to rigid motions. 

Integral, or smoothing This results when the stencil is a Newton-Cotes quadrature formula, such as Simpson's 
rule (Conte and de Boor, 1980): 

y{v) dv M ( *« + %(» + M + ^> ) h ^ i = | (0 , „, 4 , „. 

Integral operators are not local, their value being given by an interval. Thus, the corresponding stencil is 
not zero-sum. 

Integral operators depend on the choice of origin (i.e., the "DC value") and so do not seem useful to constrain the 
geometric form of an elastic net, nor would they have an obvious biological interpretation (although, of course, 
they are useful as smoothing operators). Differential operators, in contrast, can be readily interpreted in terms 
of continuity, smoothness and other geometric properties of curves — which are one of the generally accepted 
principles that govern cortical maps, together with uniformity of coverage of the stimuli space (Swindale, 1991, 
1996; Swindale et al., 2000; Carreira-Perpihan and Goodhill, 2002). Note that the effect of the tension term is 
opposite to that of the operator, e.g. a differential operator will result in a penalty for nonsmooth nets. 

Also, let us briefly consider a prior that is particularly simple and easy to implement: S = I, or a tension 
term proportional to X^md^md' resulting from a stencil q = (1). By placing it on the parameters of a mapping, 
this prior has often been used to regularise mappings, as in neural nets (weight decay, or ridge regression; Bishop, 
1995) or GTM (Bishop et al., 1998b). In these cases, such a prior performs well because even if the weights are 
forced to small magnitudes, the class of functions they represent is still large and so the data can be modeled well. 
In contrast, in the elastic net the prior is not on the weights of a parameterized mapping but on the values of 
the mapping itself, and so the effect is disastrous: first, it biases the centroids towards the origin irrespectively of 
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the location of the training set in the coordinate system; second, there is no topology because the prior factorizes 
over all y md . 

We now concentrate on differential stencils, i.e., <r n / n will approximate a given derivative of a continuous 
function f(u) through a regularly spaced sample of /. In essence, the stencil is just a finite difference scheme. 
Particular cases of such differential operators have been applied in related work. Dayan (1993) proposed to 
construct topology matrices by selecting different ways of reconstructing a net node from its neighbours (via a 
matrix £). This strategy is equivalent to using a D matrix defined as D = I — £. For the examples he used, 
£ = (0, 0, 1) corresponds to D = (—1, 1, 0) (the original elastic net: forward difference of order 1); and 
£ = (|, 0, |) corresponds to D = (— |, 1, — |) (forward difference of order 2). The latter was also used by 
Utsugi (1997) and Pascual-Marqui et al. (2001). Here we seek a more systematic way of deriving stencils that 
approximate a derivative of order p. Tables 1-2 give several finite-difference schemes for functions (nets) of one 
and two dimensions, respectively. 



4.2 Truncation error of a differential stencil 

Methods for obtaining finite-difference schemes such as truncated Taylor expansions (the method of undetermined 
coefficients), interpolating polynomials or symbolic techniques can be found in numerical analysis textbooks (see 
e.g. Conte and de Boor, 1980, section 7.1; Gerald and Wheatley, 1994, chapter 4 and appendix B; Godunov and 
Ryaben'kii, 1987, § 11.6). Here we consider Taylor expansions. Given a differential stencil we can determine 
what derivative it approximates and compute its truncation error by using a Taylor expansion around u: 

f(u + mh) = Y J f {r Xu)^f- + (0^^ meZ,£e(u,u + mh). 



r=0 



Consider a centre-aligned stencil ^ = (. . . , <^o, Ci, • • • ) and define ^ (u) = J2m Smd(u — mh) over R where 
Vm = S-m- Then: 

/oo _^ roo 

^2 ^rnd(v ~ mh)f(u -v)dv = S-m / S(v - mh)f(u - v) dv = ^ S- m f(u - mh) 
-oo m m J-oo m 

1 (8) 
= ^ m f(u + mh) = J2^rh r f ir \u)+a R+1 h R+1 f^ R+1 \0 with a r = ^5> m m r . 

m r=0 ' m 

Call p and q the smallest integers such that a p , a q ^ and a r = for r < p and p < r < q. Then, doing q = R+ 1 
gives ^_ 

&\u) = U *£ H + e(h) with e(h) d = f (9) 

where ^ is "near" u. That is, c represents a derivative of order p with a truncation error e(h) of order q — p > 1 
(note that for any differential stencil p > 1 and so ao = ^2 m ^m = 0, the zero-sum condition). Conversely, to 
construct a stencil that approximates a derivative of order p with truncation error of order q — p we simply solve 
for the coefficients s m given the values a r . 

Obviously, a derivative of order p can be approximated by many different stencils (with different or the same 
truncation error). From a numerical analysis point of view, one seeks stencils that have a high error order (so that 
the approximation is more accurate) and as few nonzero coefficients as possible (so that the convolution V * / 
can be efficiently computed); for example, for the first-order derivative the central-difference stencil (— |, 0, |), 
which has quadratic error, is preferable to the forward-difference stencil (0, —1, 1), which has linear error — in 
fact, the central difference is routinely used by mathematical software to approximate gradients. 

However, from the point of view of the GEN, the nonuniqueness of the stencil raises an important question: 
can we expect different stencils of the same derivative order to behave similarly? Surprisingly, the answer is no 
(see section 5.5). 

Before proceeding, it is important to make a clarification. It is well known that estimating derivatives from 
noisy data (e.g. to locate edges in an image) is an ill-posed problem. The derivatives we compute are not ill-posed 
because the net (given by the location of the centroids) is not a noisy function — the value of the tension term is 
computed exactly. 
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Order p 


Stencil x hP 


Error term 


Key 


1 


(0, -1, 1) 


-y"m 


1A 


1 


(-1, i, o) 




1A 


1 


|(0, 0, -3, 4, -1) 




1C 


1 


o, 1) 


-!/'"(£)£ 


IB 


1 


^(0, 0, 0, 0, -25, 48, -36, 16, -3) 




ID 


1 


^(0, 0, -3, -10, 18, -6, 1) 


-» v (0^ 


IE 


1 


^(1, -8, 0, 8, -1) 




IF 


1 


^(0, 0, 2, -24, -35, 80, -30, 8, -1) 




1G 


1 


^(-1, 9, -45, 0, 45, -9, 1) 




1H 


2 


(0, 0, 1, -2, 1) 




2A 


2 


(0, 0, 0, 2, -5, 4, -1) 




2C 


2 


0, -2, 0, 1) 


-y iv (£)^ 


2B 


2 


(1, - 2 , 1) 


-y iv (C)^ 


2A 


2 


^(0, 0, 0, 0, 35, -104, 114, -56, 11) 


-y v (£)^ 


2D 


2 


^(0, 0, 11, -20, 6, 4, -1) 




2E 


2 


^(-1, 16, -30, 16, -1) 


y vi (C)^ 


2F 


2 


-^(0, 0, -13, 228, -420, 200, 15, -12, 2) 


-y vli (0i 


2G 


2 


^(2, -27, 270, -490, 270, -27, 2) 


-y viii (0i^ 


2H 


3 


(0, 0, 0, -1, 3, -3, 1) 


-y iv (Of 


3A 


3 


0, 3, 0, -3, 0, 1) 


-y v (C)^ 


3B 


3 


2, 0, -2, 1) 


-y v (£)^ 


3C 


3 


|(0, 0, -3, 10, -12, 6, -1) 


y v (C)^ 


3D 


3 


|(0, 0, -1, -8, 35, -48, 29, -8, 1) 


-y vii (0^ 


3E 


3 


|(1, -8, 13, 0, -13, 8, -1) 




3F 


4 


(0, 0, 0, 0, 1, -4, 6, -4, 1) 




4A 


4 


(1, 0, -4, 0, 6, 0, -4, 0, 1) 


-y vl (0 2J f 


4B 


4 


(1, -4, 6, -4, 1) 




4A 


4 


|(-1, 12, -39, 56, -39, 12, -1) 


y viii (C)i; 


4C 



Table 1: A gallery of ID discretised differential operators obtained from finite difference schemes, y is a ID function 

of a ID independent variable u evaluated at points in a regular grid with interpoint separation h = f ix m +i — u m , 

so that y m = y(u m ), and £ is an unknown point near u m . The key in the last column refers to figure 4. Example: 
for 3 A we have = ym+s-Sym+2+Sym+i-ym _ y w ^ 3h 

5 Analysis of the tension term in ID 

We have a way of constructing stencils (or convolution kernels) that represent an arbitrary differential operator, 
and from a stencil the corresponding D matrix that represents the discrete convolution with the net. The tension 
term value is then the summed value in norm L2 of this convolution. In this section we theoretically analyse the 
character of the different stencils. We begin by comparing the discrete net with the continuum limit and noting 
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Order p 


Operator 


Stencil x h p 


Error term 


Key 


2 


Laplacian V 2 y 


VI/ 


0(h 2 ) 


n 


2 


Laplacian V 2 y 


*(X) 

z V l l / 


0(h 2 ) 


v 2 x 


2 


Laplacian V 2 y 




0(h 6 ) 


vl 


2 


Laplacian V 2 y 


-±r -1 16 -60 16 -1 

12 V A ) 


0{h A ) 


vl 


4 


Biharmonic V 4 ?/ 


( 2 -8 2 \ 

1 -8 20 -8 1 

V 2 I 8 2 J 


0(h 2 ) 


v 4 


4 


Biharmonic V 4 ?/ 


( -1 14 -1 \ 
I -1 20 -77 20 -1 
± -1 14 -77 184 -77 14 -1 
b -1 20 -77 20 -1 

V ^ i\ ^ J 




0{h A ) 





Table 2: A gallery of 2D discretised differential operators obtained from finite difference schemes, y is a ID 
function of a 2D independent variable u evaluated at points in a regular square grid with interpoint separation 
h = Hui+ij - Uij\\ = - Uij\\, so that y^ = y(\iij). The 2D Laplacian operator is V 2 y = |^| + |^| 

and the 2D biharmonic operator is V 4 ?/ == (V 2 ) 2 y = f^jl + ^ du 2 du 2 fiF* ^ n ^ as ^ commn re f ers to 

figure 6. Example: for we have V 2 ^- = ^ +1 ' J+ ^~ 1 ' J+y ;j +1+ ^' i ~ 1 ~ 4 ^ + (9(/i 2 ). Zero-valued coefficients are 
omitted in the stencil representation. 



the extra degree of freedom that the choice of stencil introduces in the discrete case as opposed to the unique 
definition of derivative in the continuous one. The subsequent analysis is based on the Fourier spectrum of the 
stencil (or equivalently the eigenspectrum of the S matrix) for the case of circulant matrices D (i.e., periodic b.c). 
We characterise the behaviour of families of stencils, based on the forward and central difference, respectively, 
and show how the former but not the latter matches the behaviour in the continuous case. In particular, we show 
that the frequency content of the net (the stripe width for cortical maps) moves towards higher frequencies as 
the stencil order increases in the forward-difference family, while for the central-difference family the net has the 
highest frequency for any order. We also show that the different stencils can be rewritten as Mexican-hat kernels 
with progressively more oscillations as the order increases. In particular, this means that the original elastic net, 
motivated by wirelength arguments, is equivalent to an excitatory- inhibitory Mexican hat. For the most part, we 
will consider ID nets for simplicity, although many of the results carry over to the L-dimensional case. 

While the behaviour of the GEN is given by the joint effect of the fitness and tension terms of its energy 
function, a separate analysis of the tension term gives insight into the character of the minima of the energy. 
Besides, it makes explicit the differences between the continuous and the discrete formulations of the net. 

5.1 Continuous case vs discrete case 

Let us consider the tension term of the energy function (3) of the GEN with S = D T D and again ignoring the ^ 

factor, tr (Y T YS) = ||DY T || 2 . Consider a ID continuous net y(^) == (yi (u), . . . ,yjj(u)) T G 1 D depending on a 
continuous variable u that takes values in R. We can express the tension term as a sum of D terms, one for each 
function y^, where each term is of the form: 

/oo 
(Vy(u)) 2 du= \\Vy\\l (10) 
-co 

where ||-|| 2 is the L2-norm in the space of square-integrable functions. V represents a linear differential operator, 
such as the derivative of order p, V v = f j-^. Since the tension term must be kept small, a function y having 
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Figure 1: Left: power spectrum associated with the derivative of order p of a continuous function, P(k) = (27r&) 2p , 
where k > is the continuous frequency. Right: the area under the curve (27rk) 2p \y(k)\ is the penalty of the 
function y (with power spectrum \y(k)\ 2 ). High frequencies are penalised more than low ones. 



much local variation over R will incur a high penalty and will not likely result in a mimimum of the energy. Any 
function yo belonging to the nullspace of the operator V, i.e., satisfying Vyo(u) = for all u G M, incurs zero 
penalty, and so y + yo incurs the same penalty as y. For example, the nullspace of V p consists of the polynomials 
of degree less than p. However, the penalty due to the fitness term must also be taken into account, so that the 
minima of the energy generally are not in the nullspace of V. 

When the fitness term is also quadratic, such as J (y — g) 2 du for fixed g in a regression, regularisation problems 
like this can be approached from the point of view of function approximation in Hilbert spaces. Under suitable 
conditions, there is a unique minimiser given by a spline (Wahba, 1990). However, in our case the fitness term 
is not quadratic but results from Gaussian-mixture density estimation. In general, the variational problem of 
density estimation subject to derivative penalties is not solved analytically (Silverman, 1986, pp. llOff). 

We can still get insight by working in the Fourier domain. Call y the Fourier transform of y (see appendix C). 
By applying Parseval's theorem to the continuous tension term (10) with pth-order derivative V p , we obtain that 
the tension energy is the same in both domains: 




filter power 

since the Fourier transform of is (i27rk) p y(k). This means that V p is acting as a high-pass filter, where the 
cutoff frequency increases monotonically with p; see fig. 1. Therefore, high-frequency functions will incur a high 
penalty and the minima of the energy will likely have low frequencies — again, subject to the effect of the fitness 
term. 

Using the convolution theorem, we obtain that the inverse Fourier transform of (i27rk) p is c = J^f , where S is 
the delta function, and so we can write the tension term as a convolution with this filter, Vy = <^ * y. This makes 
explicit the relation with the discrete case, which is the one we actually implement in the GEN, by discretising 
both £ and y. However, the discrete case is not an exact correlate of the continuous one because the choice of 
discrete differential operator (stencil) introduces an extra degree of freedom. This choice can result in unexpected 
results. In this paper we consider finite-difference stencils q such that the discrete convolution approximates a 
derivative of order p. In section 5.8.2 we mention an alternative definition based on the delta function. 

5.2 The tension term and the eigenspectrum of circulant matrices 

In this section we define circulant matrices and prove a few simple but powerful properties that we will need later 
(see Davis (1979) for general background). For periodic b.c, both our D and S matrices are circulant, while for 
nonperiodic b.c. they will be approximately Toeplitz and the results below are expected to hold for large nets. 
The basic idea is that the eigenvalues of a circulant matrix can be computed from its first row via the inverse 
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discrete Fourier transform and the eigenvectors are discrete plane waves. Thus, the spectrum is given directly by 
the stencil, and so we can characterise the tension term in the Fourier domain. 

Notation: we will generally use the subindex for the first row or column of a matrix. Many of the operations 
involve complex values, so |-|, •*, - T and - H will mean the modulus, conjugate, transpose and Hermitian transpose, 
respectively. Although the results hold generally, when using the symbols D and S we will implicitly assume 
that q is the stencil that produces D and that S = D T D is the tension term matrix. If a stencil is c = (a, b, c) and 
the D matrix is M x M, the left-aligned stencil (padded with zeroes to the right) will be (a, 6, c, 0, 0, . . . , 0)i x m 
and the centre- aligned stencil will be (6, c, 0, 0, . . . , 0, a)i X M- We will call uj m = f exp(i27rm/M) an Mth root 
of unity, for m G {0, . . . , M — 1}. The discrete Fourier transform (DFT) is defined in appendix C. 

The proofs of the propositions are given in appendix D. 



Definition 5.1 (Circulant and Toeplitz matrices). Consider real, square, M x M matrices D = (d nm ) and 
A = (a nm ). Then A is a Toeplitz matrix if a nm = a m _ n , where a_(M-i), . . . , a_i, ao, ai, . . . , clm-i are fixed real 
values; and D is circulant if d nm = d( m _ n ) mo d m> where do, di, . . . , c?m-i are fixed real values. Thus, we can 
represent A by its first column and row and D by its first row (the rest being successive rotations of the first): 



Toeplitz A 



/ a 


ai 


a 2 


a_i 




ai 


a_ 2 


a_i 


a 



CLM-l\ 

&M-2 

&M-3 



Circulant D 



/ d 

dM-i 

dM-2 



dl 


d 2 . 


dM-i\ 


d 




dM-2 


V-l 


do • 


dM-3 



\a_(M-i) a_(M-2) a_(M-3) a o / \ di d 2 ds . . . do / 

We will need the following properties of circulant matrices. 

Proposition 5.1 (Eigenspectrum of a circulant matrix D). TTie eigenvectors {f m }m=o °f a circulant matrix D 
are complex plane waves, i.e., f mn = e* 27r T for n = 0, . . . , M — 1, and are associated with complex eigenvalues 
x m = E^o 1 dnUJ 7 ^, respectively. 

Remark 5.2. Using the identity Yl^o (e z27r ^) k = MS m (proof: sum of a geometric series), it is easy to see that 
{f m }^~Q are orthogonal, thus linearly independent and so form a basis. Therefore, all circulant matrices of order 
M are diagonalisable in this common basis, i.e., F _1 DF = A where F = (fo, . . . , fM-i), A = diag (Ao, . . . , \m-i) 
and F _1 = ^F^ = ^F* and F = F T . Thus, any circulant matrix D can be written as D = -jFAF* in terms 
of its eigenvalues A. This also shows that if A and B are circulant then AB, A T and A -1 (if it exists) are 
circulant too, and AB = BA (circulant matrices commute). We will call F = f (fo, . . . , fM-i) the Fourier matrix. 

Remark 5.3. The DFT of y is F^y. From proposition 5.1 we have A = F T d with A = (Ao, . . . , Am-i) t and 
d = (do, . . . ,dM-i) T and so we can compute the coefficients of the first row of a circulant matrix D given its 
eigenvalues A as d = -^F* A, or equivalently d rn = ^^Jq 1 A n cj^ for m = 0, . . . , M — 1 (since A is M times the 
inverse DFT of d). 

Remark 5.4. Since D is a real matrix, complex eigenvalues come in conjugate pairs, i.e., for each A m associated 
with eigenvector f m we have A^ associated with f^. However, this does not mean there are 2M eigenvalues, 
because the identity uj m = uJ M _ rn implies A m = A M _ m . If M is odd, then we have pairs of distinct (in 

general) conjugate pairs plus Ao = J2 n d n real. If M is even, we have an additional Am = ^2 n d n (— l) n real. 
Thus, there are at most M distinct eigenvalues. 

Remark 5.5. Ao is the sum of any row or column of D and is associated with the constant eigenvector fo = 
(1, . . . , 1) T , while for even M, Am is associated with the sawtooth eigenvector f m = (1, —1, 1, —1, . . . , 1, — 1) T . 

Proposition 5.6 (Eigenspectrum of a circulant matrix S = D T D). If D is real circulant then S = D T D is 
circulant too with real, nonnegative eigenvalues ^ m = |A m | 2 associated with eigenvectors v mn = cos (27r^n) and 
w mn = sin (27rf|n). 

Remark 5.7. The previous proposition allows to obtain the eigenvalue spectrum of S from the coefficients of the 
stencil We can either construct the eigenvalues of D, A m = Yl^o 1 4^, from its first row (do, di, . . . , c?m-i) 
and those of S as v m = |A m | 2 ; or directly construct the eigenvalues of S as v m = Yln=o s n cos (^ 7r ^ n ) with s n 
given in eq. (19). 
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Remark 5.8. We will call v m = |(f m + f^) = 3?(f m ), with v mn = cos (27r^ra), and w m = ^(f m -f^) = 3?(f m ), 
with w mn = sin (27r-^n), the cosine and srae eigenvectors, respectively. Note that, while ||fm|| — ^m^m — 

M. 

the same does not hold for the cosine and sine eigenvectors v m and w m . For the latter, ||v m || 2 = ||w m || 2 = 4^ 
for m > and M for m = 0. 

Remark 5.9. Since S is invariant to permutations of the rows of D and each row of D is a rotation of the first 
row, we are free to take the stencil origin anywhere in that the eigenvalues of S do not change (those of D do, by 
a factor e l27T ^ m for an origin shift of ra/). 

Remark 5.10. S has |_4fj distinct eigenvalues at most, since v m = VM-m for ra = 1, . . . , M — 1: if M is 
odd, then we have v\ , . . . , v m-\ , each with a 2D subspace spanned by the cosine and sine eigenvectors, plus 

2 

vo = Y^n=o s n ~ {Yln=$ d n ) with a ID subspace spanned by the constant eigenvector; and if M is even, 
we have additionally vm = Y^n=v s n(— l) n = ( Y^n=<d d n (— l) n ) 2 with a ID subspace spanned by the sawtooth 
eigenvector. This is the generic situation; for specific values of d n some of the eigenvalues can degenerate. 

Remark 5.11. The relation v m = |A m | 2 implies that D and S have the same null eigenvalues and the same 
nullspace. 

Remark 5.12. The penalty due to the tension term that a net incurs can now be computed by expressing the 
net as a superposition of plane waves, i.e., uniquely decomposing the net Y = y T in the eigenvector basis of S: 

M-l M-l 

y = a ^m for a m Gl with a m = a M -m => ||DY T || 2 = tr (Y T YS) = y T Sy = M ^ a 2 m v m . (11) 

m=0 77! =0 

Thus, frequency ra contributes a penalty proportional to a^ m . The constant eigenvector of ones is the sampled 
version of a constant net. In the continuous case, when applying a differential operator of any order to a constant 
function we obtain the zero function and so no penalty in eq. (10), from eq. (11) with y = f . This holds in the 

discrete case if v$ = Yln=o s n = ( Yln=o = 0> ^ e -' ^ ^he stencil is zero-sum — a condition demanded by the 

truncation error discussion too. However, eigenvectors of the form v mn = n p for fixed p G Z + , corresponding to 
a monomial t p , are not nullified by D in the circulant case, since the rest of eigenvalues are nonzero in general. 
With nonperiodic boundary conditions D is not circulant but quasi- Toeplitz and can nullify polynomials. 

Proposition 5.13 (Fourier power spectrum of stencil c). If D is the M x M circulant matrix associated with 
the stencil the power spectrum of the stencil s is equal to the eigenspectrum of the matrix S = D T D. 

Later, we will find useful the quantity which we call the squared modulus of the stencil. The following 

proposition shows its relation with the power spectrum and the matrix S. 

Proposition 5.14 (Squared modulus of stencil c). J/D is the M x M circulant matrix associated with the stencil 
C and S = D T D ; then the total power of the stencil is P = M ^2 rn ^ = tr (S). 

Remark 5.15. The squared modulus of the stencil appears repeated along the diagonal of S. For Toeplitz 
matrices P = M J2 rn ^ holds but is in general (slightly) different from tr (S). 

Remark 5.16. The tension term penalty for a delta net y = (1, 0, 0,..., 0) equals the stencil squared modulus 
(for any dimensionality D of the net), since the convolution of the stencil with the delta net equals the stencil. If 
a D-dimensional tension term is constructed by passing a ID stencil along each dimension (section 5.7.2), by the 
additivity of the power the penalty is D times the stencil squared modulus. 

5.3 Sawtooth stencils 

The eigenvector f m , or the real cosine and sine eigenvectors, represent waves cos mt and sin mt for t = 27r-^ g 
[0,27r) of discrete frequency ra, which — unlike in the continuous case — is upper bounded by 4^. This highest 
frequency corresponds to a sawtooth, or comb, wave (1,— 1,1,— 1,...,1, — 1) T , which plays a significant role with 
certain stencils (described later). The following proposition gives a condition for a stencil to have zero power at 
the sawtooth frequency, and we will call sawtooth stencil a stencil satisfying this condition. 

Proposition 5.17 (Sawtooth condition in ID). A differential stencil s over a ID net with M centroids (M even) 
has zero power at the sawtooth frequency if and only if the even coefficients sum zero and the odd coefficients sum 
zero. 
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Proposition 5.18 (Sawtooth condition in 2D). A differential stencil s over a 2D net with M x TV centroids (M , 
N even) has zero power at the sawtooth frequency if and only if, imagining a checkerboard pattern, the coefficients 
at the black squares sum zero and the coefficients at the white squares sum zero. 

Remark 5.19. The sawtooth conditions can be analogously expressed in terms of the S matrix generated by 
the stencil. For example in ID, if D is the M x M circulant matrix associated with the stencil ^ and S = D T D, 
the condition is vm = Ylm=o s m(— l) m = 0, where (so, • • • , %-i) is the first row of S. This is a more general 
condition than that on the stencil, since different stencils can result in the same matrix S (see section 5.6). 

Remark 5.20. For nets with an odd number of centroids, the highest-frequency wave is not the sawtooth wave, 
but nearly so, and a stencil satisfying the above condition will have nearly zero power at it (since {^ m }m=i are 
samples of a continuous function of m, if some v m is zero or nearly so at some point, then it must be low in an 
environment of it). Also, in general it is not necessary that the sawtooth wave has nearly zero power for the net 
to develop sawteeth. For this, it is usually enough that the other frequencies have more or equal power. 

For stencils satisfying the sawtooth condition, the highest frequency incurs no penalty (or a negligible one) 
in the tension term, just as the zero-frequency wave (the constant net) does. This does not correspond to the 
continuous case, where for a wave of frequency n we have an average penalty of 



r2n 

which grows monotonically with m and p (for m > 1). 



^ Jo 



dP 

— — sin rnu 

duP 



2 1 
du = -m 2p 



5.4 Fourier space and stencil normalisation 

The representation of a stencil in the net domain is very variable: two stencils that may look completely differ- 
ent and have coefficients of wildly different magnitude may be representing the same derivative (with different 
truncation error). However, in the Fourier domain the character of different stencils is obvious: they are high- 
or band-pass filters. This is because their power spectra happen to have a simple aspect, typically being mono- 
tonically increasing or unimodal curves. Naturally, if the power spectrum was not simple, we would have to 
look at a different representation. From the point of view of the elastic net, then, the truncation error is of 
secondary importance — after all, the "step size" in the net will always remain quite large for 2D nets due to the 
computational cost involved. 

However, in order to compare stencils with each other (either of the same or of different order), and to compare 
nets with different number of centroids or different length 3 applied to the same training set, we need to normalise 
the stencils. This normalisation will result in multiplying the matrix S, or alternatively /?, times a constant. This 
constant will depend on two things: the order and squared modulus of the stencil, and the step size (i.e., the net 
length divided by the number of centroids). The relevant calculations can be done for nets of any dimension and 
are given in appendix B. In this paper we use the normalisation mainly in figures such as 3, which plots a family 
of stencils, and 14, which plots the resulting maps. 



5.5 Stencil families 

We are still left with the question of what stencil to choose, since to represent a derivative of order p one can 
design stencils of arbitrarily small truncation error by making zero as many coefficients a r as desired (except a p ) 
in the Taylor expansion of eq. (8), which in turn will require the stencil to have many nonzero coefficients. We can 
also derive new stencils of order p as linear combinations of existing stencils of the same order p, which we briefly 
consider in section 5.7. Instead of approaching this general question, we will deal here with a specific, but useful, 
way of generating a family of stencils: by iterating a fixed stencil. We will analyse the families associated with 
the first-order forward- difference and central- difference stencils (the backward-difference one being equivalent to 
the forward- difference one by the shift invariance). See fig. 2. 

First of all, we have the following properties. The convolution is an associative and commutative operator (in 
both the continuous and discrete cases), which is reflected in the associativity and commutativity of the respective 
circulant matrices. Repeated application of a first-order differential stencil results in higher-order stencils (recall 
that when writing a convolution of a stencil c with a function we must write the reversed stencil V)' 

/' ~ t * / => f" ~ V * /' = t * (V * /) = (V * t) * /. 

3 The "length" refers to the space of centroid indices m = 1, . . . considered continuous. For example, in ID the length is Mh where 
h is the step size (see appendix B). 
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Figure 2: Stencils s = (. . . , <^o, Ci, • • • ) obtained by repeated application of the forward-difference q = 
(0, —1, 1) and the central-difference ^ c = (— ±, 0, ±) up to order p = 4. The forward-difference family appears 
on the left vertical lines; note that the leading zeroes are irrelevant for the creation of the S matrix and so are 
usually omitted elsewhere in the paper. The backward-difference family (not drawn) results from q> = (— 1, 1, 0) 
and produces the same stencils but with trailing zeroes, and thus results in the same S as the forward- difference 
family (although the truncation error is not exactly the same). The central-difference family appears on the top 
diagonal lines. Hybrid stencils, obtained by applying q and <^ c , appear in the middle. All stencils in this figure 
are sawtooth except for the forward- difference family. Since the convolution is commutative, 
and so are the D matrices, D^D^ = D^D^, the diagram is commutative. 



We call { ip \}%L the family of stencils associated with Wq = 5, Wq = q and &\ = V * V = V * • • • * V * 5 
(p convolutions), where S is the discrete delta function (S m = 1 if m = 0, zero otherwise). Naturally, one can also 
construct hybrid stencils by applying successively different first-order stencils. 

Proposition 5.21 (Composition of stencils). Let q be a stencil of derivative order p and truncation error 0(h p ), 
and g a stencil of derivative order q and truncation error G(h q ). Then V * V = V * V is a stencil of derivative 
order p + q and truncation error 0(h min ( p ,q )). 

Corollary 5.22. If q is a stencil of derivative order 1, then ( p \ is of order p with the same truncation error 
order. 

Proposition 5.23. Let D and E be the matrices associated with the stencils q and g. Then DE is associated 
with V * V- 

Corollary 5.24. A family of stencils {^ P \}^L has associated matrices {D P }^L with D° = I and D the matrix 
associated with q = ^q. The matrix D has eigenvalues X m = ^Z n q n ^m (times an unimportant phase factor), 
the matrix = D p has eigenvalues ^ A m = A^ and the matrix = (D P ) T D P has eigenvalues ^iy m = 
|^A m | 2 = |A m | 2p associated with the cosine and sine eigenvectors. 

Proposition 5.25 (Sawtooth dominance). Let q be a stencil that has zero power at the sawtooth frequency. Then 
V * V = V * V a l so have zero power at the sawtooth frequency for any stencil g. 

This means that if a stencil is the result of the convolution of several stencils, then if any one of these is a 
sawtooth stencil, the convolution will also be a sawtooth stencil. 

5.5.1 Forward- difference family: the continuous-case correlate 

This is defined by the first-order forward-difference stencil q = (0, —1, 1), so D has eigenvalues A m = — 1 + 
e* 27r fr and S has eigenvalues v m = 2 (l — cos (27r^)) = 4 sin 2 (tt^f). Thus the pth-order derivative stencil has 

eigenvalues ^v m = (2 sin (tt^)) 2p e [0,2 2p ] Vra. Note that ]T m q m = but E m ^(-l) m + and so it is not 
a sawtooth stencil (as also indicated by vm =4^0). The case p = 2 corresponds to the small vibrations of a 
string with periodic b.c, the sinewaves fo, . . . , ^m-i being the modes of vibration. 

The following propositions show that this family forms the Pascal triangle when piled up without the p leading 
zeroes (see fig. 2), and that the stencil squared modulus (which appears along the diagonal of ^S) equals (^). 
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Proposition 5.26. For m = — p, . . . ,p; ^ p \ m = (— l) p+m (^J / or ^ > ; zero otherwise. 
Proposition 5.27. £ m = ( 2 /) and tr («S) = M( 2 /) . 

Fig. 3 shows that the forward- difference family forms a progression with p similar to that of the continuous 
case (fig. 1) where the curves slope up more slowly for larger p. Note that even though the nullspace is strictly 
that of the constant waves, since the only null eigenvalue is ^o, as p increases there are more and more near-zero 
eigenvalues for the low frequencies. In other words, the effective nullspace increases with p, consistently with the 
nonperiodic case, where the actual nullspace increases with p. Thus, in this family low frequencies are practically 
not penalised for high p. 

The combination of the power curves of figures 3 and 5 with a qualitative argument based on a drifting cutoff 
allows a partial explanation of the behaviour of the GEN during the deterministic annealing algorithm (though it 
does not allow us to compute the actual preferred frequency as a function of the stencil and /?). The predictions 
are demonstrated by our simulations in fig. 11 for ID nets and fig. 14 for 2D nets. The idea is as follows. The 
fitness term would like to have access to all frequencies so that the net matches the data points, while the tension 
term penalises frequencies proportionally to the power spectrum (remark 5.12). If we assume that an optimal net 
can have at most a certain tension, then it will have at most a certain power pk* (where k* is the largest frequency 
with power less or equal than that power). This power will increase as cr decreases, since the relative influence of 
the tension term decreases. We call the power pk* (cr) the drifting cutoff, since frequencies whose power exceeds 
Pk* (cr) cannot be present in the net at scale cr. 

The drifting cutoff predicts that the nets that arise are expected to show higher and higher frequency as p 
increases. For large cr the tension term dominates, so that even a small tension penalty is not allowed, and the net 
collapses to a point at the centre of mass of the training set. This, which corresponds to the fact that the energy 
has a single minimum, was proven by Durbin et al. (1989) for the original elastic net and is easy to extend to our 
general case. As cr is decreased, the influence of the tension term gradually decreases in favour of the fitness term; 
the energy experiences a series of bifurcations and develops more and more minima which correspond to the net 
stretching more and more in various ways to approach the training set points. Now imagine a cutoff power pk* 
in fig. 3 (the horizontal dashed line) that corresponds to a maximum allowed tension penalty at a given cr. For 
large cr, the cutoff is at pk* = and the only frequency possible for any stencil order is k = 0, i.e., a constant net 
(all centroids at the centre of mass). As a decreases, the cutoff power increases (as in the figure), so that for a 
small p^ the corresponding cutoff frequency fc* is given by the power curves; thus, k* increases with p, i.e., the 
early behaviour of the nets shows higher frequency for high p (narrower stripes in cortical maps). Note that the 
increase of the cutoff proceeds in jumps (corresponding to the bifurcations of the energy) rather than uniformly. 
As cr is further decreased and the cutoff power further increases, higher frequencies appear in the net, but they 
are constrained to develop on the already existing low- frequency net, and so result in local quirks and stretchings 
towards the training set points, superimposed on a lower- frequency structure. 

It follows that if one starts the training at a low value of cr, so that the cutoff corresponds to a high frequency 
fc*, the emerging net loses its smooth structure with little difference between orders p, as we have confirmed in 
simulations. The same occurs if annealing too fast. This can also be understood without recourse to the power 
curves by noting that at low a the fitness term dominates and the centroids are drawn towards the training set 
points with little regard to the tension. 

From fig. 3 we see the cutoff frequencies at low a cluster for high p, i.e., p = 1 and 2 differ much more from 
each other than p = 3 and 4. This is also confirmed by the simulations and has been quantitatively evaluated in 
cortical maps by Carreira-Perpihan and Goodhill (2003). Specifically, the original elastic net (p = 1) often comes 
out as qualitatively different from nets with p > 1. 

The drifting cutoff argument also explains the effect of f3: at any cr or p, increasing j3 strengthens the tension 
term penalty and so lowers the cutoff power p^ , which results in lower cutoff frequencies (wider stripes in cortical 
maps, see fig. 14). 

In 2D nets, the drifting cutoff argument explains not only the behaviour of the preferred frequency of the 
plane waves in the net but also the preferred direction. Note in fig. 5 how for p = 1 the contour lines near 
the k = (0,0) frequency are approximately circular. Thus, for a given power cutoff the largest frequency ||k|| is 
attained approximately equally in any direction, and the resulting stripes in cortical map simulations show no 
preferred direction (fig. 14, p = 1). But for p > 1 (again in fig. 5) the contour lines near k = (0,0) become 
approximately square. Thus, the largest frequency occurs along the line k\ = &2 (or k\ = — £2), an d the resulting 
stripes run preferentially along the diagonals. Other 2D stencils (e.g. Vg in fig. 6) are more isotropic around 
k = and do not result in preferred stripe directions. 
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Figure 3: ID stencils for the forward- and central-difference families, for a net with M = 50 centroids. The left 
panels show the stencils (in the net domain), normalised in maximal amplitude. The right panel shows the power 
spectrum (in the Fourier domain), normalised to integrate to 1 (see appendix B); only its left half, k G [0, 4^], is 
nonredundant. A relatively small M is used to stress the fact that the power is discrete, although underlying it 
is a continuous curve. The horizontal dashed line represents a cutoff power corresponding to a certain value of 
the annealing parameter a (see section 5.5.1). 



5.5.2 Central-difference family: nets with sawtooth waves 

This is defined by the first-order central-difference stencil c = (— |, 0, |), so D has eigenvalues A m = | (— 1 + e* 47r ^) 
and S has eigenvalues u m = | (l — cos (47r^)) = sin 2 (27r^). Thus the pth-order derivative stencil has eigenval- 
ues ^v m — sin 2p (27r^) G [0, 1] Vra. Note that J2m — Xl™ w(— l) m = and so it is a sawtooth stencil (as 
also indicated by vm = 0); thus, all stencils in this family are sawtooth. 

The following proposition shows that the stencil of order p of this family can be obtained from the forward- 
difference stencil of order p by intercalating zero every two components and dividing by 2 P (see fig. 2). 

Proposition 5.28. For m = 0, . . . , 2p: ^ p \ m — l) p+n (^) f or m = 2n even, zero otherwise. 
Proposition 5.29. £ m (p) <£ = ^ ( 2 /) and tr (^S) = M^(J) . 

This family also has a progression with decreasing slopes at low frequencies (see fig. 3), but every one of its 
stencils is a sawtooth stencil. Thus, both the low and high frequencies are practically not penalised. Given that 
the fitness term will favour high frequencies, because this generally allows to match training set points better, the 
elastic nets resulting from this family of stencils very often contain sawtooth patterns (for low enough a). Such 
sawtooth patterns may take all the net or part of it, and can appear superimposed on a lower- frequency wave for 
some values of a. One can also understand why this happens by noting that the tension term decouples into two 
terms, one for the even centroids and the other for the odd centroids (the zero coefficients of the stencil alternate 



19 



with nonzero ones). However, in general it may not be obvious from the structure of the stencil whether it is 
sawtooth, as in ^(— 1, 2, 0, —2, 1) in fig. 7; of course, the sawtooth conditions of section 5.3 can always be used. 
Naturally, 2D stencils obtained by combining a ID central-difference stencil along the horizontal and vertical 

directions are also sawtooth. As a different example of sawtooth stencil in 2D, consider V 2 = \ ( j - 4 j ^ , which 
is a well-known finite-difference approximation to the 2D Laplacian with quadratic truncation error. This stencil 
verifies the sawtooth condition and so develops sawtooth in 2D, which appear as grating- or checkerboard-like 
patterns in the OD and OR maps. One can also understand why this happens in an analogous way to the ID 
case, as follows. The linear combinations of consecutive points in the matrix D do not have any element in 
common; this can be visualised by shifting the stencil either horizontally or vertically and seeing that no nonzero 
coefficients overlap. Consequently, the tension term becomes the sum of two uncoupled terms, one for the "black 
squares" and the other for the "white" ones (imagining again a checkerboard). 

The fact that the central-difference stencil will typically result in a net with sawteeth suggests that it should 
be avoided in applications where continuity of representation is important, such as cortical maps. However note 
that, while the whole net develops sawteeth, the two uncoupled subnets show individually a smooth structure 
(see e.g. fig. 7A,B). This suggests we can use the central-difference stencil to solve a multiple TSP, where the 
training set "cities" must be visited by a given number of salesmen (see fig. 8 and section 7). 

The same type of techniques can be applied to any matrix D, not necessarily a differential operator (although, 
as discussed in section 4.1, non-differential operators are not desirable because they bias the centroids towards 
the origin of coordinates). For example, for ^ = (1), which corresponds to D = S = I, the eigenvalues of S 
are v m = 1 (DFT of a delta). Thus, all frequencies are equally penalized, in agreement with the fact that the 
prior p(Y) factorizes and all centroids are independent. For ^ = |(1, 4, 1) (Simpson's integration rule) we get 
v m = (| (2 + cos (27r^))) 2 , which decreases from a maximum at frequency to a minimum at frequency 4f, the 
opposite to the forward difference. Thus, even though vm > 0, the sawtooth frequency is the least penalized and 
so the net develops sawteeth — consistent with the fact that this is an integral, or smoothing, stencil. 

5.6 The inverse problem and "evenised" stencils 

In section 3.2 we saw that many different matrices D could result in the same S = D T D, thus having the same 
power spectrum and the same tension term value for any net. However, many of these do not correspond to a 
stencil, i.e., their rows are not cyclic permutations of each other, or their elements are complex numbers. In this 
section we (1) show that two nontrivially different stencils (i.e., discounting shifts and sign reversals) can result 
in the same S and (2) show how to obtain a stencil that produces a given positive (semi)definite circulant matrix 
S (the inverse problem). As a corollary of (1) for the forward-difference family we rewrite the sum-of-square- 
distances term of the original elastic net as a Mexican-hat stencil and show that higher-order stencils add more 
oscillations to this Mexican hat. 

We start with the inverse problem, in which we want to obtain a stencil that results in a given power spectrum 
{Pm}m=o- I n g enera ^ there are infinitely many (complex) stencils that would have that power because the discrete 
Fourier transform at every frequency must only agree in modulus and can have any phase (once we fix the DFT, 
the stencil is of course unique). That is, taking the eigenvalues A m = ^Jpm~e %ern for any phase m G [0, 2tt) we 
satisfy |A m | 2 = p m , but then the corresponding stencil is complex for most choices of the phases. However, in 
the particular case where the power spectrum is even, i.e., p m = PM-m for m = 1, . . . , M — 1, there are several 
real solutions, thanks to the fact that both the DFT and the inverse DFT of a real, even collection of numbers 
is also real and even (as can easily be seen). If we take the eigenvalues of D as A m = ^^Jp^ (any sign is 
valid as long as A m = Am-™) then the first row of D (i.e., the stencil) is the DFT of {A m }£f~o divided by M, 
dm = Yln=o A n e~* 27r ^ n , which is real and even. Note that if {A m }^~Q are real but not even then {d m }^~Q 
may be complex. 

The inverse problem can be stated equivalently in terms of circulant matrices as follows: given a circulant, 
symmetric positive (semi)definite matrix S whose eigenvalues are {p m }^~Q, decompose S = D T D with D real 



Figure 4 (following page): A gallery of ID stencils of orders p = 1 to 4, as in fig. 3. The keys 1A, etc. correspond 
to those in table 1. While in the net domain the stencils may look very different, the power curves look similar, 
typically either peaking at pm (like the forward-difference stencils) or dipping at pm (like the central-difference 
ones), the latter being sawtooth stencils. Also note that, while for a fixed p the non-sawtooth power curves differ 
little from each other, there are occasional outliers (e.g. ID). 
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Figure 5: 2D stencils for the forward- and central-difference families (resulting from passing the respective ID 
stencil horizontally and vertically). The graphs show the power spectrum in the space (&i,&2) £ [0,M] 2 , with 
(fci, fe) = (0,0) at the lower left corner, for a square net with M x M = 50 x 50 centroids; white means higher 
power. The nonredundant power spectrum is in its first quadrant [0, 4^] . 
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Figure 7: Typical nets resulting from sawtooth stencils. A: sawtooth arising in a ID nonperiodic net for ocular 
dominance and retinotopy with the third-order stencil |(— 1, 2, 0, —2, 1) (3C in table 1), which results from 
applying twice the forward-difference (0, —1, 1) and once the central-difference ^(— 1, 0, 1). B: sawtooth arising 
in a ID periodic net for a TSP, with the central-difference stencil ^(— 1, 0, 1). C: sawtooth arising in 2D nets for 
ocular dominance with the third-order stencil ~(— 1, 2, 0, —2, 1). This stencil does not penalise the sawtooth 
frequency either horizontally, vertically or diagonally (see fig. 5). D: sawtooth arising in 2D nets for ocular 
dominance with the Laplacian stencil V % . This stencil does not penalise the sawtooth frequency diagonally, but 
it does horizontally and vertically (see fig. 6), so a grid pattern as in C is not possible, but a checkerboard pattern 
is. Compare with the maps resulting from the forward-difference family in fig. 14. 




Figure 8: A possible application of sawtooth stencils. Left: TSP with original elastic net, stencil ^ = (0, —1, 1) 
(forward difference). Right: the same city set but using a sawtooth stencil ^ = ^(— 1, 0, 1) (central difference). 
The tours defined by the even and odd centroids (represented by the dotted lines) can be used to obtain a 
good solution to a multiple TSP. In both cases the b.c. are periodic and the figures correspond to a small but 
nonzero value of the annealing parameter a. Compare with fig. 7B, which shows (for a different city set) the 
central-difference stencil at a larger a (earlier development stage). 
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circulant. To solve the problem, we first prove that the eigenvalues of S are a real, even collection, and then 
obtain D from them. 

Proposition 5.30. Let S be a real circulant matrix with eigenvalues {z/q, . . . , ^m-i} associated with the Fourier 
eigenvectors fo, . . . , Im-i • Then v rn = VM-m, m = 1, . . . , M — 1<^>S is symmetric. 

Remark 5.31. Likewise, if S is real circulant symmetric, then its first row satisfies s m = sm-™> m = 1, . . . , M — l: 

M-l M-l 
k=0 k=0 

Remark 5.32. Any real circulant matrix S = -jFNF* (symmetric or not) can be decomposed as S = A 2 with 

A = ^tFN2F* (where v}/ 2 can have an arbitrary sign), but A need not be real. Proposition 5.30 proves that 
a symmetric real circulant matrix S can be decomposed as S = A T A with symmetric real circulant A = A T = 
-jFNsF* (where the roots y/v^ must be chosen with a sign that preserves y/Tfa — y/VM-m)- Thus, we can 
associate one (or more) real stencils with the first row of S, and then S is a penalty matrix for that stencil, i.e., 

y T Sy = ||Ay|| 2 = ||V*y|| 2 . 

Remark 5.33. One cannot use the spectral decomposition of S as S = UNU T with U real orthogonal (made 
up of the sine and cosine eigenvectors) and N real diagonal because then N2 U T is not circulant in general. 

In summary, we have proven that, given a power spectrum {Pm}m=o verifyingp m = PM-m for m = 1, . . . , M—l 
(or a real symmetric circulant positive (semi) definite penalty matrix S), we can always obtain an even, real stencil 
^ for it (i.e., verifying <^ m = Qw_ m ). This is done by simply taking the eigenvalues of the D matrix as A m = + v /p™ 
and obtaining the stencil as the inverse DFT of them. This also gives a method to evenise any stencil that does 
not verify <^ m = Qw_ m (such as ^ = (0, —1, 1)). We apply it to the forward- and central-difference families in the 
next section. 

Note that the evenisation procedure also implies that we can write any S matrix that results from a stencil as 
S = D 2 (since when using evenised stencils, D is symmetric). Thus, the pth-order matrix is simply = D 2p = 
1 FA 2p F * if D = 1 faf*. 



5.6.1 Forward- difference family 

From section 5.5.1 we have that the eigenvalues of are ^v m — (2 sin (kjj)) 2p , ra = 0, . . . ,M — 1. Since 
sin(TT^) = sin(7r ( M ^ R )) > for m = 1,...,M- 1, we can take = |2 sin (^r^) | ^ = (2 sin (tt^)) p . Thus 
the even, real stencil is 

{p)d - = i £ (p) 4 «b (2,^) = 2 -y. ™ p 4 cos ( 27r S fc ) m = °' • • • - M - 1 (12) 

k=0 ' 1 k=0 K ' 7 



verifying {p) d m = {p) d M -m for m = 1, . . . , M - 1. 

We do not have a formula for this sum in general, but the following proposition characterises the stencils 
asymptotically on the number of centroids M (the limit stencils are good approximations for M > 10 already). 

Proposition 5.34. For the stencil defined by (12), we have 



{P) d 00 d_e_f y (p) , 

M^oo 



p even: 



'(-l) m ( f !j, m<§ 
0, m>f 



p odd: —1 , m = 0, 1, 2, . . . , 00 



^ELlo (4m 2 -(2fe + l) 2 ) 

where we define (2n)H = 2 • 4 • 6 • • • (2n), (2n + 1)!! = 1 • 3 • 5 • • • (2n + 1), 0!! = (-1)!! = 1. 

We also give the exact expression for the case p = 1. It is easy to confirm that (13) tends to the corresponding 
expression in proposition 5.34 for M — >> 00. 
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p = 2n + 1 


n 


sign for m = 


1 2 3 4 5 6 ... 




1 

3 
5 
7 



1 

2 
3 


+ - + + + + + 
+ - + ---- 
+ - + - + + + 


-2/(4m 2 - 1) 

12/(4m 2 - l)(4m 2 -9) 

-240/(4m 2 - l)(4m 2 - 9)(4m 2 - 25) 

etc. 



Table 3: Pattern of sign alternation of the ID "evenised" stencils of the forward-difference family, for odd order 
p. 
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(6, -4, 1, 0, ... 
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(20, -15, 6, -1, 0, ... 
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4 
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- + 




+ 








(70, -56, 28, -8, 1, 0, ... 



Table 4: Pattern of sign alternation of the ID "evenised" stencils of the forward- difference family, for even order 
p. 



Proposition 5.35. For the stencil defined by (12), we have for the case p = 1; 

^d m = 4j Y sinfTT^ Icos^A) = m = 0,...,M-l. (13) 

Consequently, for even p the limit stencil coincides with the original forward-difference stencil (with an imma- 
terial sign change) and is therefore sparse (or compact-support) since only a finite number of elements is nonzero. 
In fact, the stencils must coincide not only in the limit but for finite M, since they are already even, but we do 
not have an expression for Y^=q sin 2p (jr-j^) cos (2Kj^k) to prove it. For odd p the evenised stencils are not 
sparse, but decay as (9(m p+1 ) with a finite number of sign changes, or oscillations, for small m, as given by the 
following proposition. 

Proposition 5.36. For p = 2n + 1 odd, (2n+1) d^ has n + l sign changes in m = 0, 1, 2, ... , n and constant sign 
for m > n, the sign being positive for n odd and negative for n even. That is: 



sgn(^c) 



(-l) m , m<n 
(-l) n+1 , m>n. 



The sign alternation has then the pattern shown in table 3. In particular, ^c?5n (^ ne original elastic net) 
is positive only at m = 0, the central element, and is thus a discrete Mexican hat (or centre-surround kernel), 
with higher-order stencils being oscillatory versions of Mexican hats (i.e., with more rings). Stencils with p = 
1, 5, . . . , 4n + 1 for n = 0, 1, 2, . . . have long-range inhibitory connections while stencils with p = 3, 7, . . . , 4n + 3 
have long-range excitatory connections. Fig. 9 plots the evenised forward-difference stencils for the first few orders 
p (for M = 50); note how for even p the stencils are sparse and coincide with those of fig. 2. 

From proposition 5.34 for even p = 2n we have that 



sg n p">d; 



(— l) m , rn < n 
0, m > n 



just as for odd p. The sign alternation has the pattern shown in table 4. The stencils are, as before, oscillatory 
Mexican hats but with compact support, i.e., strictly short-range connections. The case p = 2 is similar to p = 1 
in having only the central element as excitatory connection. 

The analysis of the stencils in 2D becomes more complicated, with the S matrix being block-circulant, though 
the results should carry over in an analogous way. 
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Forward- difference family Central-difference family 




Figure 9: ID "evenised" stencils of the forward- and central-difference families, normalised in maximal amplitude. 
Compare with the net-domain plots of fig. 3 (the power spectra are the same by definition). For even p they 
coincide with the original ones in fig. 3, except for a change of sign. 



5.6.2 Central-difference family 

From section 5.5.2 we have that the eigenvalues of are = sin 2p (27r^), m = 0,...,M — 1. Thus the 

even, real stencil is 



Linn 



M-l 



-E 

M ^ 



k=0 



sm p 2tt 



M 



cos 



/ m \ 
( 2 *M*) 



m = 0,...,M-l 



(14) 



and the following proposition characterises it asymptotically. 
Proposition 5.37. For the stencil defined by (14), we have 



(p) <?= lim 



f (—1) 2 2 p (p^r^), m even and rn < p 
p even: < 2 2 

I 0, m odd or m > p 



p odd: 



' (_l)4^g_(^)!p!! 

^nS(m 2 -(2fc+i) 2 )' 

lo, 



rn even 



rn odd. 



0,2,4,. ..,oo 



Again, for even p the limit stencil coincides with the original central- difference stencil (with an immaterial 
sign change) and is therefore sparse, while for odd p it is not sparse, decaying as 0(m p+1 ) with a finite number 
of oscillations. The pattern of sign alternation is like that of the forward-difference one but over the even indices 
m (since the odd ones are zero). It is possible to obtain the stencil exactly for p = 1 as in eq. (13) but we shall 
not dwell in precise statements. 

5.7 Linear combinations of stencils 

Another way to create new stencils is to combine linearly stencils of the same or different order. We have two 
options: either to combine in the net domain or in the power domain. We consider both cases for stencils of the 
same order; for stencils of different order, we also need to include the step size h (since it is powered to the order). 



5.7.1 Net domain 

Consider without loss of generality a linear combination (I.e.) stencil c = f ag + bvo where both g and w are of 
order p. From the truncation error expression (9), in order for ^ to be a differential stencil of order p we must 
have a + b = 1 = a p (the a p coefficient for q). If both a and b were nonnegative we could further restrict them 
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to be in [0, 1] and so we would have a convex I.e. s = jg + (1 — 7)07, 7 G [0, 1]; but we accept negative values for 
either a or b. Also note that if a + 6 = then <^ would be of order higher than p. 
Some examples using first-order stencils: 

2(-±, 0, |) - (0, -1, 1) = (-1, 1, 0) (2 x central - 1 x forward = backward) 

2 ((0, -1, 1) - (-|, 0, !)) = (1, -2, 1) (2 x (forward - central) = forward, with p = 2, since a + 6 = 0) 
7(0, —1, 1) + (1 — 7)(— ^, 0, |) (smooth transition between forward and central) 

We see that the I.e. with a sawtooth stencil need no result in a sawtooth stencil. 

By the linearity of the convolution and the Fourier transform, combining in the net domain is the same as 
combining in the Fourier domain and so we have: 



This is very different from combining the power spectra, as we show below. Note that a I.e. in the net domain 
can be designed from scratch, since it is just a particular way of zeroing a r 's coefficients. 

Studying l.c.'s of stencils provides another way to characterise the space of stencils. For example, the dif- 
ferential stencils having at most 3 consecutive non-zero coefficients (a, 6, c) form a 2D vector subspace (since 
a + b + c = 0) spanned by the forward- and central-difference stencils. We will not pursue this approach further 
here. 

Note that using a I.e. of stencils of different order, such as J2p=i a p * s the discrete correlate of penalising 
functions / that do not satisfy the differential equation J2p=i a PllS = ^- 

5.7.2 Power domain 

The I.e. should now be convex in general because the power cannot be negative. Then, for |^| 2 = 7 |^| 2 + (1 — 
7) \zbk\ 2 for each k = 0, . . . , M — 1 we have = 7^,/c + (1 — 7)^7, & and we can express the matrix as 
resulting from two matrices, = ^ yx^D ) ' so = D^D^ = 7$^ + (1 — 7)S W . Basically, we are adding 

more rows to D, which end up being further squared l.c.'s to add to the tension term. 

This is exactly what we do to create a 2D stencil out of a ID one: we first pass the ID stencil horizontally to 
produce a matrix Dh; and then pass it vertically to produce D v . The resulting 2D stencil has a matrix S = D^Dh+ 
D^D V (see fi gure 5 for the power spectrum and figures 14-16 for simulations from cortical maps resulting from 
them). For example, for the first-order forward- difference stencil, we obtain terms of the form ||y m +i, n — YmnW + 
||ym,n+i - Ymn || 2 , whose derivative in the gradient descent algorithm is of the form 4y mn - (y m+ i, n + y m _i, n + 
Ym,n+i +y™,n-i) an d has been used in the past in cortical map models. Note that the term ||y m +i, n — y m n|| 2 + 
||ym,n+i -ymn\\ 2 cannot be expressed as the square of a single I.e. ||Ay mn + By m +i,n + Cy m , n+ i|| 2 with real 
coefficients A, 5, C (though it can with complex coefficients, e.g. A = — V% B = (1 + i)/y/2 = C*). Although 
we could find a (nonsparse) evenised 2D stencil for S, in practice it is simpler to use T>2MxM — ( ^"z^td )• 

For 2D stencils it is also possible to design the stencil directly in the net domain by forcing the relevant 
terms of the 2D Taylor expansion to vanish, e.g. as for the Laplacian stencils V 2 ^ or V 2 . This also allows one 
to introduce other constraints, such as isotropy in some sense — deriving a 2D stencil from horizontal and vertical 
passes of a ID one may result in nets that heavily emphasise those directions. Compare the contour lines around 
k = (0,0) of the power spectrum of (1, —2, 1) in fig. 5 (which align with the coordinate axes) and those of any 
of Vi, Vg or V 2 in fig. 6 (which are roughly circular). However, the horizontal- vertical method results in simple 
b.c. (basically the same as in ID), while stencils such as require a more careful definition of b.c. For example, 
for \j\ one cannot simply zero all rows of D for which the stencil overspills, because the corner points of the net 
would end up in no I.e. (their associated columns in D would be zeroes) and thus be unconstrained. 

In the power domain, the I.e. = 78^ + (1 — 7)8^ of a sawtooth stencil g with a nonsawtooth one w smoothly 
transitions as 7 takes values from to 1. An interesting question we leave for future research is for what 7 the 
I.e. starts showing sawtooth behaviour. 

5.8 Extensions 

We have given a general methodology to characterise the spectral properties of families of stencils with periodic 
b.c. and studied in detail two important families. The space of stencils is, however, much larger. Here we discuss 
additional ways of defining families and nonperiodic b.c. 
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5.8.1 The nonperiodic case 

Nonperiodic b.c. can be defined in different ways, perhaps using ideas from partial differential equations (such as 
Dirichlet or Neumann b.c; Press et al., 1992; Godunov and Ryaben'kri, 1987). They are particularly necessary 
when modelling domains that are not rectangular or have holes, but also when modelling adjacent domains 
with different properties (e.g. to investigate the lateral connections at the border between VI and V2). In the 
simulations of section 6 we use a simple type of nonperiodic b.c. by analogy with the original elastic net: if (a 
nonzero coefficient of) a stencil falls out of the net when applied at a given centroid, we discard it by making 
zero the corresponding row of D. If the stencil size is small compared to the net dimensions, only a few rows are 
zeroed. 

With nonperiodic b.c. the D and S matrices are not circulant anymore, but almost Toeplitz (see e.g. eq. (6)). 
The analysis becomes harder because in general there are no formulas for the eigenvalues (though there are 
asymptotic results for the distribution of the eigenvalues, such as the Szego formula; Grenander and Szego, 
1958). The eigenvectors are not plane waves anymore; in fact, for matrices such as those of eq. (6) (for a = 0) 
the nullspace of D and S becomes larger and can contain discrete polynomials of degree less than p, as in the 
continuous case. However, as M — >> oo, we woould expect our analysis to hold for nonperiodic b.c. 

Many of our results can be considered as an extension of spectral graph theory (Chung, 1997) to higher-order 
derivatives in regular, periodic grids. However, one could think more generally of defining the tension term not 
on a regular grid in dimension D, but on arbitrary graphs. Higher-order derivatives defined on graphs may have 
applications in related areas, such as spectral clustering, of recent interest (e.g. Shi and Malik, 2000). 

Also note that, even for the forward-difference family, the discrete case is not exactly equivalent to the 
continuous one. For example, while the second-derivative annihilates any linear function, in the discrete case the 
centroids must not only be collinear but also equispaced for the net to have zero penalty. 



5.8.2 Stencil design in power space 

So far we have designed stencils in the net space, either from scratch (truncation error equations) or by repeated 
application of known stencils (families). Instead, we can design the stencil in the Fourier domain by specifying 
desirable power spectra (satisfying p = for the stencil to be differential and p m = PM-m fc> r it to be real) and 
then invert it to obtain an even stencil in the net domain. In particular, we can use intermediate power curves 
that would correspond to noninteger derivative orders p (fig. 3). A desirable requirement would probably be that 
the power is a nondecreasing curve, since otherwise some high frequencies would be less penalised than the lower 
ones. One problem with this approach is that inverting a power spectrum to obtain an stencil results in general in 
nonsparse stencils (with many, or typically M, nonzero coefficients). This means that the matrix S is not sparse 
anymore and the computational cost of the learning algorithm becomes very large in both memory and time. One 
possibility would be to find the stencil that most closely approximates a given power spectrum subject to having 
a few nonzero coefficients (for a related problem see Chu and Plemmons, 2003). 

Note that these noninteger-order stencils cannot be obtained as linear combinations of stencils in the net space; 
the power spectra must be averaged instead. In other words, as we saw in section 5.7, the linear superposition 
of stencils in the power space is very different from that in the net space (analogously to the interference of 
incoherent and coherent light, respectively). For example, the average of the forward and backward differences 
in net space is the central difference, while in the power space it is the forward difference. 

This also suggests a way of defining noninteger derivative orders p in the continuous case by taking the inverse 
Fourier transform of (i2nk) p for p G M + . 



5.8.3 Stencils with a scale parameter 

The differential stencils we have used are strictly local in that they do not depend on a scale parameter. For 
example, the forward difference involves only two neighbouring points, irrespectively of the net size M . One way 
(inspired by scale-space theory; Lindeberg, 1994) of introducing a scale parameter in the stencil, and thus define a 
different class of stencils, is as follows. Consider a sequence of continuous functions indexed by a scale parameter 
s G M + that converges to the delta function, such as Gaussians g s (u) = exp (— ||u|| 2 /2s). Now, instead of taking 
the derivative of a function /, we take the derivative of / convolved (smoothed) with g s (or, since the derivative 
and convolution operators commute, smooth the derivative of /). We have: 

d p f d p d p g 
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Thus, we can define a stencil by discretising g s p — ~dup~' ^ e Gaussian case, g s p has a similar shape as the 
forward- difference family and is given by the Hermite polynomial of order p times the Gaussian kernel g s . 

6 Application to cortical map modelling 

In this section we demonstrate by simulation the behaviour of generalised elastic nets, corroborating the previous 
analysis (section 5.5.1 in particular) and also pointing out characteristics that need theoretical explanation. All 
the simulations were obtained for the forward-difference family (orders p = 1 to 4) using the Cholesky factorisation 
method. We consider the problem of cortical maps in primary visual cortex, to which the elastic net was originally 
applied by Durbin and Mitchison (1990) and Goodhill and Willshaw (1990). In the low-dimensional version of this 
problem, the training set of "cities" are values of stimuli to which primary visual cortex cells respond (we consider 
the orientation of an edge OR, its position in the visual field VF and its eye of origin OD) and the centroids 
are the preferred values (receptive fields) of the cells, arranged as a 2D rectangular grid. The stimulus space 
is densely populated with training points, arranged for computational convenience as a grid 4 whose respective 
dimensions determine the development of the net. The typical development sequence with annealing of the a 
parameter from high to low values is the same as for the original elastic net, as follows. At large a the energy E 
has a single minimum at the centre of mass of the training set. At a value <7o the net topographically expands 
along the principal components of the training set, chosen to be the VF axes; cr depends on the training set 
geometry and the S matrix, and can be calculated as in Durbin et al. (1989) as the value of a where the Hessian 
matrix of E stops being positive definite (the energy surface bifurcates), noting that the Hessian of the tension 
term is here |lo ® (S + S T ) (in Kronecker product notation). At a critical value a c < ao the net expands along 
the OD and/or OR axis (whichever has the largest training set variance) and develops stripes. Fig. 10 shows 
this sequence for stencil orders 1 and 2, for a simplified problem where the net is ID and the stimulus space 
consists of the ID VF position and the OD. Note how, unlike the lst-order net, the 2nd-order net avoids sharp 
corners, which have locally high curvature, and goes slightly out of the convex hull of the training set. These two 
characteristics always occur with order p > 1 and are most apparent for intermediate values of <r; when a — » 0, 
the fitness term overwhelms the tension term and the net develops kinks anyway in an effort to interpolate the 
training set (high frequencies are less and less penalised as a decreases). Fig. 11 shows the decrease of the stripe 
width with the stencil order for p = 1 to 4 (with constant /?), this time using periodic b.c. for the net (not for the 
training set). 

In our simulations for p = 1 (at least in dimension D < 3, which can be visualised) the net centroids (and 
therefore the links between them) always remain inside the convex hull of the training set. This can be intuitively 
understood by noting the following. (1) If a centroid lies outside the convex hull, the fitness term value can be 
increased by bringing it towards the convex hull (since the Gaussian is isotropic and monotonically decreasing 
from its centre). (2) The tension term encourages centroids to concentrate. Thus, maxima of E should have no 
centroids outside the convex hull. However, for higher-order tension terms, step (2) is not true anymore, since 
stretched nets can have zero penalty and the centroids can lie outside the convex hull (the tendency increasing 
with the order p). For example, in fig. 10 the rounded ends of the net that exceed the convex hull result in a lower 
tension than flattening the ends and producing two sharp corners with a heavy cost in the second derivative. In 
cortical map models one would rather not have centroids outside the convex hull, since they result in stimulus 
preferences outside their valid range (for VF, OD and OR modulus). If annealing slowly enough and if stopping 
the training shortly after all maps have arisen, the centroids are inside the convex hull except for a few that are 
slightly outside; further, one can define the ranges to exceed the training set range. Another possibility would be 
to constrain the learning algorithm to keep the net inside the convex hull. However, apart from being a difficult 
task, this would probably result in rather different nets. 

We investigate now the effect of f3 and the stencil order p on the ocular dominance and orientation maps with 
a 2D net. As is well known from earlier work on dimension reduction models such as the elastic net and self- 
organising map applied to cortical maps (Wolf and Geisel, 1998; Scherf et al., 1999), the space of j3 and a can be 
decomposed in regions as in a phase diagram. In each region or phase the resulting cortical maps of OD and OR 
have qualitatively different properties. Here we have an additional variable 5 , the stencil order p, and we discuss 
the resulting phase diagram obtained by simulation (assuming the forward- difference family with normalisation 

4 This grid acts as a scaffolding for the stimulus space and is computationally convenient for batch learning. It is also possible to 
use online training and generate a random data point uniformly in the stimulus space (e.g. Wolf and Geisel, 1998). This results in 
very similar maps, though for fixed a the online method takes very long to stabilise. Note that the annealing schedule may need to 
be adjusted if different values of (3, the training set or the stencil are used. 

5 The stencil order p as used here is discrete, but could be considered continuous if interpolating the power spectra as described in 
section 5.8.2. 
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to unit power). For a given p, the OD and OR maps arise at a given critical value a c of a (where a c decreases 
with /?), and do not arise at all if f3 is too large (though for p > 2 we have not obtained unsegregated maps for the 
range of j3 considered). As shown in fig. 13 (right), a c increases with p. Figures 14-16 show the maps on a slice 
of the space a) for a value of a small enough that the OD and OR maps have arisen. Fig. 13 (left) shows a 
schematic phase diagram over (f3,p). Region 1 (for small j3) contains salt-and-pepper maps, without continuity 
among neighbouring centroids; intuitively this reflects the fact that if f3 is very small then the smoothing effect of 
the tension term disappears. Region 2 contains unsegregated maps, where the net stretches (to some extent) along 
the VF variables but not along the OD and OR variables, even as a tends to 0. Region 3 contains columnar maps 
with local geometric characteristics (such as stripe width, pinwheel density, crossing angles or gradient matching) 
that depend on j3 and p. The stripe width increases with f3 and decreases with p, as predicted in section 5.5.1. 
For fixed /?, Carreira-Perpihan and Goodhill (2003) have determined the following effects dependent on p: (1) 
the distribution of crossing angles between the contours of the OD and OR maps is biased towards orthogonality 
for p = 1 but quickly flattens to become uniform as p increases; (2) the OR pinwheels tend to lie away from OD 
borders for p = 1 but gradually occupy OD borders (like beads on a string) as p increases; and (3) the contours of 
OD and OR align preferentially along the diagonals as p increases (due to the anisotropy of the stencil explained 
in section 5.5.1). The maps that have been characterised experimentally so far are best matched in region 3 for 

In the <j variable there is a narrow interval /l just below the critical cr c in which the following phenomenon 
occurs: if training at a fixed a G II, the stripe widths of OD and OR increase and the number of pinwheels 
decrease (where pinwheels of opposite sign approach and annihilate) as a function of the iteration number. 
Pinwheel annihilation has been noted in the literature of cortical map models (Wolf and Geisel, 1998). If training 
for a very long time, it is possible to get rid of nearly all OD borders and OR pinwheels. The reduction of the 
number of OD borders and OR pinwheels are really the same phenomenon, which we call loop elimination (see 
fig. 12), and seems particularly strong for p > 1. It corresponds to the fact that the energy function E presents 
a deep minimum at the net without loops that is only reachable for a G /l; for smaller <j, E develops another 
minimum to which the algorithm is always attracted, corresponding to the striped net. Thus, a fast annealing 
will result in striped maps (which are observed biologically) while a very slow one will result in maps without 
stripes or pinwheels (which have not been observed biologically). The phase diagram for the slow annealing case 
changes as follows. Region 1 (for small j3) contains a constant map, practically identical for all values of p and 
/?, which is severely distorted and contains stripes of different widths. Region 2 contains unsegregated maps as 
before. Region 3 contains columnar maps with loop elimination, i.e., wide stripes and few pinwheels (depending 
on the slowness of the annealing). 

Another intriguing issue is the fact that the global structure of the maps can sometimes be very similar 
for different values of p in part of region 3 of the (/3,p) space. For example, the maps in fig. 14 for (/3,p) G 
{(10 1 , 1), (10 3 ,2), (10 5 ,3)}; and for the diagonal from (10°, 1) to (10 3 ,4). This effect can be partially explained 
by the drifting cutoff argument of section 5.5.1 as follows. For the forward-difference family the power is pu = 
(2 sin (tt^)) 2p and in 2D we normalise by dividing it by twice the stencil squared modulus, 2( 2 ^) (see appendix B), 
so pk ~ \^JWpsm 2p (ff). Assuming the emerging net Y has frequency the tension term value will be 
^ tr (Y T YS) oc TfPk* = \P^/np sin 2p ( 2 ^-) , and any combination of /3 and p for which this tension equals a given 
constant (dependent on /c*) should result in roughly the same map. So we would expect that similar maps occur 
along the family of curves 

log f3 = — 2p log sin ( J — - logp + constant 

where all the logarithms are in base 10. For k* « M (wide stripes) these are straight lines with positive slope 
2 log (M/nk*). This expression indeed matches well the slopes for the cases mentioned above, and also justifies 
the use of a logarithmic scale for f3 (otherwise evident from fig. 14). Note that the slope is at most 2 log (M/-7r) 
and occurs for a single-striped map (k* = 1). The region for log f3 > 2plog (M/tt) should correspond to maps that 
are either single-striped or unsegregated. In the region of thin stripes (for k* close to M/2) the curves become 
horizontal and logarithmic in p. 

However, even though some OD maps may be similar for different p, detailed examination shows that the 
geometric relations between the OD and OR maps still differ as mentioned above (e.g. the pinwheels colocate with 
OD borders as p increases). A rigorous theoretical explanation for the structure of the 3D phase map (/3,p, a) 
both in the continuous and discrete cases is left for future research. 

The GEN could be further extended in two useful ways for cortical map modelling. (1) Instead of using 
deterministic annealing, we could learn both Y and a (e.g. with an EM algorithm) and interpret the resulting 
a as a receptive field size. However, a prior on a that penalises high values will be necessary, because otherwise 
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p=l, (1 V=(0, -1, 1) 



p = 2, (2 k=(l, -2, 1) 



q 




Figure 10: Typical sequence of map development for a ID generalized elastic net in a 2D stimulus space of VF 
(X axis) and OD (Y axis) for the forward-difference family (stencils of order 1 and 2), with nonperiodic b.c. and 
annealing. The thin greyscale bar below the plots represents the OD value (black for one eye, white for the other) 
along the centroids m = 1, . . . , M as a ID ocular dominance map. The net is initially unselective to ocularity, 
after which an initial wave appears, and the final state is a set of stripes. Note the difference in curvature between 
both stencils, e.g. the absence of corners in the second-order stencil (for intermediate a). 



the result is typically a net with large a that stretches along the principal component of the training set but not 
along orthogonal directions with lower variance. In effect, the freedom in the selection of the annealing schedule 
becomes the freedom in the selection of the initial net and the prior on a (biologically interpretable as a constraint 
on the receptive field size). (2) We could consider a stencil <; m or tension strength j3 m that depends on the centroid 
index. For example, the stripe width of OD is smaller at the fovea than at the periphery of the visual field and 
this might be due to a dependence of the lateral connection pattern on the cortical location. Both things can be 
readily incorporated in the model without modifying the learning algorithm, by e.g. defining manually the rows of 
the matrix D, or defining a tension term ^ tr (Y T YS) with S = D T BD, where B = diag . . . , /3m)- However, 
it would be more parsimonious to express explicitly the dependence on m, e.g. by defining f3 m = am + /3q or 
fim = A)6 am for some parameters f3o and a (which could also be learned from the data). 

7 Discussion and related work 

Dynamical system analysis of continuous elastic nets The most important question in the context of 
cortical map modelling is, given a stencil, to characterise the emerging net in terms of stripe width, orthogonality 
of maps, etc. We have given a qualitative explanation of the characteristics of the GEN for different stencil 
orders and values of j3 (for periodic b.c). Although our analysis of the frequency spectrum of the stencils is 
rigorous, the resulting nets depend on the combined effect of the fitness and tension terms. In particular, the 
interrelations that arise between different maps (such as the crossing angles and matching of gradients in 2D nets) 
cannot be explained by the tension term alone since different dimensions are independent in our formulation of 
the tension term. The analysis is simpler if one considers continuous elastic nets, by considering the gradient 
descent optimisation as a dynamical system y = —crVE and then linearising it. Some analyses of this type have 
been done for the continuous first-order ID net in a 2D space of VF and OD with periodic b.c. (self-organising 
map: Obermayer et al., 1992; elastic net: Wolf and Geisel, 1998; Scherf et al., 1999; Wolf et al., 2000; Dayan, 
2001). Here one posits an equilibrium solution (inferred by symmetry considerations) consisting of a topographic 
net unsegregated in ocular dominance, linearises the dynamics at the bifurcation of OD and then studies the 
eigenvalues of the corresponding eigenfunctions of the linearised operator (which are sine waves). It is then 
possible to derive a relation for the value <r c at which the OD bifurcation occurs and the wave frequency k c that 
dominates. 

This analysis is more difficult for the discrete case and depends on the specific stencil used in the tension term. 
From the power spectrum, the forward-difference family is the closest analog of the continuous case, and thus 
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p = 1, = (0, -1, 1) p = 3, < 3 >s = (0, -1, 3, -3, 1) 




Figure 11: Selected subset of cortical map simulations with a generalized elastic net with the forward- difference 
family, for the same problem type as in fig. 10, with periodic b.c. (for the net, but not the training set) and 
j3 = 10 2 . Except for p, all other parameters and the initial conditions are the same. The nets are shown at a 
value of the annealing parameter a shortly after the net expands along the OD axis. Note how the frequency 
increases with p. 



analysis of the continuous case version should correspond to the limit M — »• oo in this case. Linear combinations 
(in stencil space) of forward-difference stencils should yield to this analysis too. However, for stencils of other 
types (e.g. central-difference ones) is it not obvious how to transform the net into a continuous one. Also, some 
effects that do not appear at map formation time but in the (much) longer term, such as the loop elimination, 
may not be explained by the linear approach; and the understanding of the geometric relations between maps 
requires analysing 2D nets. A theoretical understanding of this topic is left for future research. 

Connection with Hebbian models Although originally introduced in the TSP context, most work in the 
elastic net has been carried out in the cortical map literature. Yuille et al. (1996) (see also Yuille, 1990) sought to 
unify two types of cortical map models: the elastic net, and Hebbian models such as that of Miller et al. (1989). 
They did this by defining a generalised deformable model whose energy depends on two types of variables: a set 
of binary variables V that define constraints on the solution (that each neuron maps to a unit in either the left or 
the right eye but not both, and that each eye unit maps to a neuron); and the centroids Y. Eliminating V from 
the energy leads to the elastic net energy, while eliminating Y leads (through several approximations, such as a 
mean field one) to a model closely related to that of Miller et al. (1989). Dayan (1993) went further and suggested 
the use of a more general quadratic form S for the tension term as we do (though he did not derive a training 
algorithm). He then derived an equation similar to (5) and, based in Yuille et al.'s link to the Hebbian model, 
suggested an explicit correspondence between — /3S and the matrix B in Miller et al. (1989), which represents the 
lateral connection pattern. 

However, the link between both models is rather indirect because of the approximations involved with the 
Hebbian model, and also with the elastic net: in order to obtain the latter from the generalised deformable model, 
one needs to assume the binary constraints V hold. But the OD variable in elastic net simulations is continuous, 
the assignment of probability to each centroid is soft (at least for a > 0) and the number of centroids differs in 
general from that of stimuli values. This makes it hard to compare directly the lateral connection matrices in 
both models. Instead, we propose to compare our D matrix with B (see below). 

The effort in comparing elastic and Hebbian matrices was partly motivated by the fact that in appearance the 
elastic net defined a limited lateral connection pattern, where only nearest-neighbouring cells were connected. We 
have shown that this nearest-neighbour connection pattern is exactly equivalent to a certain Mexican-hat lateral 
connection pattern (proposition 5.35). Further, we can draw a formal comparison between the GEN and the 
model of Miller et al. (1989). In the GEN, the D matrix represents the convolution with the neuron preferences 
(represented by the centroids Y); this is then squared and its terms added to obtain the tension term. In the 
model of Miller et al. (1989), the role of the convolution with the cortical neuron preferences is played by the 
matrix B. This suggests we should compare D with B (or their stencils). Also note that in the model of Miller 
et al. (1989) the matrix that is actually taken as a Mexican hat is not the true lateral connection matrix B but 
the rather more abstract (I — B) _1 (which Miller et al. (1989) call "cortical interaction function;" it operates not 
on the neuron preferences but on the net input that a neuron receives, roughly equivalent to our X). However, 
using a Mexican hat in (I — B) _1 results in highly oscillating stencils in B, which — given our results — may be 
one reason why this model does not appear to reproduce robustly either the orthogonality between OD and OR 
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Figure 12: The phenomenon of loop elimination without annealing (for fixed <r), for p = 2 (forward difference). 
r indicates the number of iterations (solutions of the system (5) with the Cholesky factorisation method). Left: 
pinwheel annihilation and stripe widening for a 2D cortical map model (a = 0.05). The training set is a grid in 
the rectangle [0, 1] x [0, 1] x [-0.14,0.14] x [-f , §] x {0.2} with iV = 10 x 10 x 2 x 6 x 1 points and the net has 
M = 72 x 72 centroids and j3 = 100 with nonperiodic b.c. Right: loop elimination for a ID cortical map model 
(<j = 0.016). For this particular case loop elimination happened only for a G [0.016,0.0245]. For other a the net 
converged to a striped configuration. The training set is a grid in the rectangle [0,1] x [—0.0422,0.0422] with 
N = 36 x 2 points and the net has M = 144 centroids and /3 = 5 000 with nonperiodic b.c. In both cases, the net 
that appears initially is striped; then, as the training continues over a long term, loops are eliminated in sudden 
but widely separated events. 
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Figure 13: Left: schematic phase diagram in the space of for a value of a small enough that the OD and 

OR maps have arisen (a = 0.05 in the actual simulations). For fast annealing the regions mean: salt-and-pepper 
maps (1), striped map (2) and unsegregated map (3); compare with fig. 14. For slow annealing, region 1 contains 
a map independent of p and j3 (but not necessarily salt-and-pepper) and region 3 contains maps with wide stripes 
and few pinwheels. The boundaries between regions are drawn approximately. The dotted lines in region 3 
represent curves along which the stripe width is constant, for 4^ > k\ > /c| > /c| > 1. Right: critical value a c for 
which the OD and OR maps start to arise, as a function of j3 for each p. For each p, values of a above the curve 
represent unsegregated maps (region 3). The horizontal line at a = 0.05 marks the point when maps are plotted 
in figures 14-16. The curves for p = 2 to 4 have been raised slightly to distinguish them. 



or the location of pinwheels at the centre of OD columns (Erwin and Miller, 1998). 

Diffusion equations Consider the continuous version of the tension term of eq. (10) for the pth-order derivative, 
with periodic b.c. and in ID for simplicity: 

Et(y) = ^ J {y {p \u)fdu 

on the class of periodic functions y with continuous derivatives up to order 2p. Consider the variational problem 
of finding the net y in that class that minimises the energy E = Ef(y) + E t (y) (Ef(y) being the continuous version 
of the fitness term). The first variation of E t (y) with respect to y can be found by standard variational calculus. 
Let 5y be an arbitrary function of the same type as y. Then: 

5_E^y)_ = Hm E t (y + h5y)-E t (y) = f y ( P ) {u)§y ( P ) {u) du = ( _ 1)P/5 [ y Qp)( u)Sy ( u)du = (-If ^ 
oy h^o h J J 



Figure 14 (following page): Simulations for a 2D generalised elastic net as a model for maps of visual field VF, 
ocular dominance OD and orientation OR. This panel shows the OD maps obtained for a range of values of j3 and 
stencil order p (forward-difference family with normalisation to unit power), at a value a = 0.05 small enough 
that the OD and OR maps have arisen, with all other parameters and the initial conditions being the same. The 
stripes are narrower for low f3 and high p. Simulation details: square net with M x M centroids (M = 128); 
nonperiodic b.c; annealing schedule: from a = 0.2 down to a = 0.05 in a geometric progression with 40 iterations. 
The training set consisted of a regular grid of (10, 10, 2, 12) values in [0, 1] x [0, 1] x [— Z, I] x [— f , f ] for the 2D 
visual field, OD (/ = 0.07) and OR, respectively; the periodic OR variable is implemented as a ring of radius 
r = 0.2 in 2D Cartesian space (see Carreira-Perpihan and Goodhill, 2003 for details). The relative stripe widths 
of the OD and OR maps depend on the correlation parameters of OD / and OR r and has been studied elsewhere 
(Goodhill and Cimponeriu, 2000). 
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p = 1 




Figure 16: As fig. 14 but for contours of ocular dominance and orientation. 
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where we have integrated by parts p times and taken the limit over a test function Sy that is nonzero near one 
point u and zero elsewhere. We now minimise the corresponding Euler-Lagrange equation by gradient descent 
and obtain: 

1 dy = _S_E = ^ /Py _ SE f (y) 
a dt Sy [ } P dv?P Sy 

where t is the continuous iteration time. Disregarding the fitness term, for constant a and p = 1 (the original 
elastic net) this is a diffusion equation for the net y with diffusion constant a (3. That is, a tension term of the 
form ^ J || Vy || 2 results in a diffusion term V 2 y that opposes the attraction towards the cities coming from the 
(nonlinear) fitness term. Note that the elastic net was originally derived from the tea-trade model of topographic 
map formation (von der Malsburg and Willshaw, 1977), based on the diffusion through the brain of hypothesised 
chemical markers. The learning equation (5) can also be seen in this light, with the matrix S acting as a 
discretised derivative of order 2p. However, for order p > 1, the differential equation has order 2p > 2, so it no 
longer corresponds to diffusion. 

Note also the correspondence between the continuous and discrete tension terms: in the continuous case, 
again integrating by parts we obtain J (y^) 2 = (—l) p J yy^ 2p \ or using linear operators in a Hilbert space, 
(V p y,V p y) = (—l) p (y,V2py). Correspondingly, in the discrete case we have ||D p y|| 2 = y T D 2p y by representing 
the first-order matrix D with an evenised stencil (so D is symmetric). 

Differential equations Difference schemes have, of course, been used extensively in the approximate solu- 
tion of ordinary and partial differential equations (Godunov and Ryaben'kii, 1987; Samarskii, 2001), e.g. when 
approximating a boundary value problem by finite differences over a mesh. In this context, the main concerns 
regarding the choice of difference scheme are: accuracy (whether the truncation error is of high order, which will 
mean a faster convergence), stability (whether the approximate solution converges to the true one as the step size 
tends to zero) and sparsity (whether the scheme has few nonzero coefficients). For example, the central difference 
(quadratic error) is preferable to the forward difference (linear error) in terms of accuracy, and equivalent in terms 
of sparsity (both have two nonzero coefficients). Sparsity is desirable to minimise computation time and memory 
storage, but has to be traded off with accuracy, since a high-order truncation error (as well as a high-order deriv- 
ative) will require more nonzero coefficients. As for the stability, it depends not only on the scheme itself but on 
the problem to which it is applied, that is, the combination of type of the differential equation and b.c, difference 
scheme and step size. For example, given a particular differential equation to be solved, the forward and central 
difference may both, none or either one or the other result in a stable method (Godunov and Ryaben'kii, 1987). 
This contrasts with the use of stencils as quadratic penalties, where accuracy and stability do not apply (since 
for any stencil the matrix S is positive (semi) definite and so the algorithm converges), while whether the stencil 
is e.g. sawtooth is crucial. Sparsity remains important computationally, although as we showed some penalties 
admit a dual representation by sparse and nonsparse stencils. 

Maximum penalised likelihood estimation (MPLE) This consists of the following nonparametric density 
estimation problem (Silverman, 1986; Thompson and Tapia, 1990): 

max^log/(x n )-/3i?(/)j 

where {x n }^ =1 is the data sample, / is a density in a suitable function class and R is the smoothness penalty. Note 
that without the penalty the problem is ill-posed because the likelihood becomes infinite as / — >> J2 n =i ^( x — x n)- 
The fact that / must be nonnegative and integrate to 1 so that it is a density makes this problem mathematically 
difficult and most of the existing work concerns the ID case. One approach (Good and Gaskins, 1971) consists of 
defining / = j 2 to remove the nonnegativity constraint. Scott et al. (1980) consider a discretised MPLE problem 
where R is based on the first or second forward differences and take / as piecewise constant or linear. 

In the GEN, the problem is of density estimation too, but we do not directly penalise the density. If we 
consider a Gaussian- mixture density / whose centroids are a function of a lower-dimensional latent variable z: 

/(x)oc|e-H^ir dz 

then the GEN results from discretising the latent variable z and placing the penalty on the discretised centroid 
function /x(z). 
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Other learning algorithms The deterministic annealing algorithm of section 2 has two bottlenecks: one is 
the computation of the weights W (a very large nonsparse matrix); the other, the solution of the system (5). 
We consider the latter here. In this paper we have dealt with tension terms that result in a sparse, structured, 
positive semidefinite matrix S, as a result of the locality of the stencils implementing differential operators and 
the spatial structure of the net. For this particular type of S (or even for a centroid-dependent stencil), the sparse 
Cholesky factorisation for solving system (5) works very robustly and efficiently, and the computation time is 
dominated by the weight computation. However, one may wish to use nonsparse stencils — e.g. if one designs the 
stencil in power space, or uses a stencil derived by discretising a continuous kernel with unbounded support. In 
this case the Cholesky factorisation will be slower, taking 0(M 3 ) operations (where S is M x M), and it becomes 
necessary to develop a more sophisticated algorithm to solve the sytem (5). 

One method that has been used in Bayesian problems and Gaussian processes to invert a covariance matrix is 
conjugate gradients (Skilling, 1993; Gibbs and MacKay, 1997). This constructs a finite sequence with monotoni- 
cally decreasing error that converges to A _1 u (or in general 0(A) u for any function </>). An approximate solution 
is then obtained in ©(several x M 2 ) rather than 0(M 3 ). Unfortunately the number of iterations "several" may 
be difficult to determine in general. Besides, conjugate gradients do not take advantage explicitly of the structure 
of the matrix S, which is in general close 6 to circulant or Toeplitz (by blocks in 2D). Intuitively, such a matrix 
has only O(M) degrees of freedom rather than 0(M 2 ). In fact, see Golub and van Loan (1996, pp. 193ff), it 
is possible to invert Toeplitz matrices of order MxMin 0(M 2 ) using the Trench algorithm and to compute 
products of a circulant matrix times a vector via the fast Fourier transform in O(MlogM) (this also applies to 
Toeplitz- vector products by embedding the M x M Toeplitz matrix in a (2M — 1) x (2M — 1) circulant matrix). 
Extensions of this type of algorithms to near- Toeplitz and other structured matrices are given by Kailath and 
Sayed (1999). 

Topographic maps and dimensionality reduction From a statistical learning point of view, the GEN is 
both a density estimation and a dimensionality reduction method. As a dimensionality reduction method, the 
GEN can be seen as a discretisation of a continuous latent variable model with a uniform density in latent space 
of dimension L, a nonpar ametric mapping from latent to data space (defined by the centroids Y) and an isotropic 
Gaussian noise model in data space of dimension D (Carreira-Perpihan, 2001). A discretised dimensionality 
reduction mapping for a data point x can be defined as either the mean or a mode of the posterior distribution 
p(m\x). Like the dimensionality reduction mapping, the reconstruction mapping (from latent to data space) is 
discretised, implicitly defined by the locations of the centroids. Intermediate values can be defined by linear 
interpolation and we have done so in linking neighbouring centroids in figures 10-12 for visualisation purposes. 

Traditionally the elastic net has not been used for density estimation or dimensionality reduction, but rather in 
applications where generalisation to unseen data was not a concern: the TSP (where the goal is data interpolation 
with minimal length) and cortical maps (where the goal is to replicate biological maps and the training set is 
just a computationally convenient scaffolding for a uniform distribution). In fact, in both cases the intrinsic 
dimensionality of the data (cities and visual stimuli, respectively) is higher than that of the net, which results in 
the net behaving like a space-filling curve or surface (thus the stripes). 

However, the GEN could perfectly be used for data modelling (e.g. as in Utsugi, 1997). It could also be 
used in computer vision and computer graphics, where models (often implemented as a spline) penalised by 
curvature terms have been very successful (e.g. the snakes of Kass et al., 1988 or the functional of Mumford 
and Shah, 1989) — the GEN having the advantage of defining a density. The GEN is most closely related to 
the self-organising map (SOM) and the generative topographic mapping (GTM), with which we provide a brief 
comparison (Bishop et al., 1998b give a comparison of the SOM and GTM). Like the GTM and unlike the SOM, 
the GEN defines a probabilistic model and so an objective function; this gives a principled approach to learning 
algorithms, determination of convergence and inference. GTM has the advantage over both GEN and SOM of 
defining a continuous reconstruction mapping (that passes through the Gaussian centroids). The existence of an 
objective function also allows to define penalties (priors on the mapping or centroids) in a natural way. In the 
GEN, it is very easy to design a derivative penalty of any order in any dimension of the net, either by passing ID 
versions of a forward- difference stencil, or designing a stencil from scratch. In GTM, derivative penalties (e.g. to 
penalise curvature) can in principle be implemented exactly, since the reconstruction mapping is continuous and 
parametric; but this may be practically cumbersome, so that they may be more conveniently approximated by 
discrete stencils as we do with the GEN. For the SOM, such penalties might be defined directly in the learning 
rule (similarly to momentum terms) but this seems more difficult. Alternatively, one could define an objective 

6 Strictly, the matrix that matters is A = G + cr£ ( S+ 2 ) ' This nas the same structure as S but adds a different value to each 
element of the diagonal. Thus, even if S is circulant, A is not circulant or Toeplitz unless G oc I. 
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function that results in a SOM-like algorithm and then apply penalties to it (Luttrell, 1994; Pascual-Marqui et al., 
2001); such an approach is likely to end up in something like the GEN. Computationally, all three models suffer 
from the curse of dimensionality since the latent space is a grid whose number of centroids grows exponentially 
with its dimensionality. 

Two recent methods for manifold learning are Isomap (Tenenbaum et al., 2000) and locally linear embedding 
(LLE; Roweis and Saul, 2000), neither of which defines a density. Isomap attempts to model global manifold 
structure by preserving approximate geodetic distances between all pairs of data points. In contrast, the GEN's 
tension term uses only local information. LLE is closer to the GEN in that it assigns to each manifold point a 
squared zero-sum linear combination of neighbouring points. This might be considered as a local metric estimate, 
similar to the GEN stencil. However, while the latter applies truly to local points in the manifold (by construction) 
in LLE some neighbouring points may in practice not be local in the manifold. 



Combinatorial optimisation and generalised TSPs The original elastic net was proposed at first as a 
continuous optimisation method (by deterministic annealing) to solve the Euclidean travelling salesman problem 
(TSP; Lawler et al., 1985) of finding the shortest tour of N given cities (although the elastic net tries to minimise 
the sum of squared distances rather than the sum of distances). The TSP is an NP-complete problem, with a 



search space of 



(N-l)\ 



different tours. Although the elastic net does find good tours in a reasonable time, it is not 



really competitive with the leading heuristic TSP methods (Potvin, 1993; Reinelt, 1994; Johnson and McGeoch, 
1997; Gutin and Punnen, 2002). These include n-opt, Lin and Kernighan (1973) and variations of them, and 
their basic idea is to replace n arcs locally in the current tour ( "perform a move" ) and iterate till a local optimum 
is found; usually n is 2 or 3. 

The GEN proposed here can solve a correspondingly generalised TSP (likewise an NP-complete problem, 
to which n-opt-type algorithms are applicable). We show here how an appropriate definition of the D matrix 
may be used for other TSP-type problems, in particular the multiple TSP. In the if -TSP (Lawler et al., 1985, 
pp. 169-170), we are looking for a collection of K subtours, each containing a fixed start city, such that each city 
is in at least one subtour. This problem models the situation where K salesmen work for a company with home 
office in the start city, and between them each city must be visited at least once. Lawler et al. (1985) propose 
the length of the maximum length subtour as quality measure of the collection of routes; and claim that the 
best algorithm known is a tour-splitting one, i.e., one that starts with a standard travelling salesman tour and 
proceeds to partition it into K pieces. 

The GEN framework defined in this work can be used for a if -TSP where the number of centroids Mk in each 
subtour k is given in advance. We simply partition the matrix D diagonally in K submatrices; submatrix k is 
Mk x Mk. For example, if there are K = 3 subtours with k\ = 3, = 5 and ks = 4 centroids, respectively, then: 



D 



o \ 



where we have used nonperiodic b.c. for all subnets. Clearly, we can create more complex combinations of topology 
(e.g. have each subnet have its own b.c. and dimension). As we pointed out in section 5.5.2, if the number of 
centroids M is odd the D matrix associated with the central- difference stencil \{— 1, 0, 1) is equivalent to a 
2-block D matrix (as can be seen by permuting the centroid indices) and results in two disjoint subnets (if M 
is even we get the usual elastic net but with permuted centroids). These two nets occupy disjoint regions of 
city space, which can be qualitatively justified as follows. Assume, for a random distribution of N cities over an 
area V (of any dimensionality D > 1), that the expected shortest tour length is approximately proportional to 
I = N p V q with p, q > 0, where the actual p and q depend on the distribution of the cities. For example, for 
uniformly random cities p = 1 — and q = (Beardwood et al., 1959; Lawler et al., 1985, p. 186ff). If we 
have K tours each on a disjoint region with the same area and number of cities then the total tour length is 
proportional to K (N / K) p (V / K) q = K l ~ p ~ q l, while if all tours interleave over the whole area we have a total 
tour length proportional to K(N/K) p V q = K l ~ p L Thus, it pays better to have disjoint tours, and if p + q > 1 
it pays better to have as many tours as possible. This is demonstrated in fig. 7A, where the two subtours are 
simply lines along the upper and lower rows of cities, and in fig. 8; and is consistent with the sawteeth linking 
faraway centroids. 

To get the subtours to share a fixed city, we can add a constant centroid yo and connect it to every subtour. 
In general, we can extend the model of section 1 by adding "bias terms" to the prior on the centroids Y, i.e., 
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taking Y = (Yo Yi) with Yo constant and Yi variable. The D matrix can connect the variable centroids with 
each other as usual and also any constant centroid in Yo to one or more variable centroids in Yi (connecting 
constant centroids with each other clearly has no effect on the optimisation). 

The strategy of separating the net into several uncoupled nets is equivalent to allowing discontinuities to occur 
unpenalised at specific points; see Terzopoulos (1986) for an analogous idea in a continuous setting. 

Finally note that, from a TSP perspective, the globally optimal solution of the ID OD problem of fig. 10 is 
a net with a □ form (i.e., one eye is traversed across VF, then the other eye). In all our simulations, no matter 
how slow the annealing the original elastic net always results in a wavy net (e.g. fig. 10), while the second-order 
forward-difference stencil can obtain the □ net through loop elimination as in fig. 12 (if annealing very slowly 
and using a high value of j3). 



Relation with the C measure of Goodhill and Sejnowski (1997) Goodhill and Sejnowski (1997) intro- 
duced a measure C of the degree of similarity preservation, or neighbourhood preservation, between two discrete 
spaces mapped by a bijection H. C was defined as the following cost functional: 

N 

c tgj2J2 F (i,j)Gm),m) as) 

i=l j<i 

where z, j G {1, . . . , N} label points in one space and F and G measure the similarity between two points in each 
of the spaces, respectively. If we take i and j in the cortical space, with H(i) = yi being the centroid in city 
space, and define the similarity functions as follows: 

= \ a ; d Jare n ^ hbou ^ om),m) = \\m - m\r P 

10, otherwise y 

where "neighbouring" means adjacent in the cortical space grid, then we get the minimal wiring definition 
(Mitchison and Durbin, 1986; Durbin and Mitchison, 1990). And for p = 2, minimising C is equivalent to solving 
the TSP with the original elastic net. Other choices of F and G coincide (approximately) with well-known 
measures of neighbourhood preservation. 

However, eq. (15) cannot accommodate a GEN tension term tr (Y T YS) or ||DY T || because C is defined by 
products of binary (pairwise) relations. It is possible to extend C to accommodate it by using multiple relations: 

N 

c = E E E • • • F ^ fc > • • • ) G ( ff «' H w> H ^ • • • ) 

i=l j<i k<j 



• • 7 x def jl, ij,k,... are neighbouring n (uc\ uc\ ua\ \ 
10, otherwise 

but there seems to be no particular benefit in doing this. 



linear combination of 
H(i),H(j),H(k),... 



What stencil to use? As in other work where differential penalties are used, there is not a general answer to 
this question — it depends on the particular application. In the discrete case the problem is compounded by the 
choice not only of differential operator but also of finite difference scheme. In this paper we have studied some 
properties of two families of stencils in the context of a density estimation problem (the elastic net, applied to 
cortical map modelling), and shown that the forward-difference family is closer to the continuous case. In general, 
we think that the choice of stencil should be given by its power spectrum, subject to the stencil being sparse for 
computational efficiency. 

The work done in continuous problems, such as nonparametric regression with roughness penalties (Green 
and Silverman, 1994; Wahba, 1990) suggests other conditions that the appropriate operator should satisfy. For 
example, in dimension 2 isotropy may be desired, which requires the penalty to be invariant to rotations — though 
note that this is not naturally defined in the discrete case. Also, the particular application may suggest what 
functions are natural solutions, and so this defines the nullspace of the operator; for example, if a periodic 
function is desired, then the operator + should be used (where uj is a frequency). In practice, most work 
has restricted itself to 2nd-order derivatives at most, and in fact most differential equations of physics are also 
2nd-order or less (with few exceptions, one being the 4th-order biharmonic equation satisfied by the deflection of 
a loaded beam in elasticity theory). 



41 



Higher-order derivatives may be necessary or advisable in higher dimensions, however. The continuous case 
provides a constraint on the derivative order given by the dimensionality. Consider a pth-order differential operator 
in the L-dimensional continuous case, such as the following rotationally and translationally invariant penalty: 

f sr^ P ] ( d p V V 

d!,...,d L >0 

Surprisingly, this operator requires p > L/2, as the following example from Green and Silverman (1994) suggests 
(the argument can be made rigorously in terms of Beppo-Levy and Sobolev spaces). Consider the "test function" 
# s (u) = exp(— ||u/«s|| 2 ). Then, by a change of variables u — » v/s, we obtain R(gi) = s 2p ~ L R(g s ). For R to 
measure indeed roughness, and since a delta function is maximally rough, R(g s ) must increase as s — >• and so 
we must have 2p — L > 0. In other words, we have to make sure that the squared magnitude of the derivative 
term (s 2p ) dominates the loss in volume of the differential element du\ . . . duL (s L ). 

In the discrete case it is difficult to determine a concrete condition (if it exists at all) between the derivative 
order and the dimensionality of the net, partly because it depends on the stencil chosen to represent the derivative 
and on the test function, and partly because the finite size of both the stencil and the test function make the 
calculation complicated. Consider the forward- difference family. The tension of the delta net equals the stencil 
squared modulus (section 5.2) and grows as 2 P , so that the higher the derivative the higher the roughness value. 
But there is a dimensionality effect on the tension term as a measure of roughness. For example, consider a 
discrete version of the test function that is a triangular bump of height 1 and width W across all dimensions 
(W = 1 recovers the delta net). Then it is easy to see that for p = 1 the tension term is (9(W L_2 ), so that in 
2D the tension is the same for the delta net as for a broad bump, and in L > 3 the delta net has lower tension. 
For p = 2 we obtain 0(W L ~ 3 ). This might suggest the use of somewhat higher-order derivatives as the net 
dimensionality grows. However, this issue is less important than in the continuous case because the net resolution 
is lower bounded and so is the tension term value. Besides, in practice one is anyway limited to very small L 
(since the number of centroids M grows exponentially with L) and relatively small p (because the size of the 
stencil grows with p and should be much smaller than M). Finally, the use of a lst-order forward difference in 2D 
for cortical map models (Durbin and Mitchison, 1990; Goodhill and Willshaw, 1990; section 6) does not result in 
rough nets in comparison with higher orders. 

8 Conclusions 

We have studied algorithmically, theoretically and by simulation a generalisation of the elastic net and applied it 
to cortical map modelling. The original motivation of this work was to understand the role of the tension term, 
which is an abstract form of lateral interactions, in the development of visual cortical maps. 

We have defined the generalised elastic net as a probabilistic model given by a Gaussian mixture (fitness term) 
whose centroids are subject to a Gaussian prior (tension term). We have given several deterministic annealing 
algorithms for learning the centroids of the net and argued that the one based on the sparse Cholesky factorisation 
is the most effective for the type of tension terms we favour. These result from the convolution of the net with a 
discrete, sparse stencil that approximates a differential operator in the Taylor sense. 

We have given a theoretical analysis of this type of tension terms in the case of periodic boundary conditions in 
ID. By analogy with the continuous case, the properties of the stencil are apparent in the Fourier power spectrum, 
or equivalently in the eigenvalues of its circulant matrix. However, in the discrete case the space of stencils is 
very large, since there are many different stencils that correspond to the same derivative order. We have studied 
two families obtained by repeated application of a given first-order stencil, forward and central differences, and 
suggested other ways to design stencils. The forward- difference family is the analogue of the continuous derivative, 
with stencils behaving as high-pass filters. The central-difference family results in band-pass filters which produce 
sawtooth patterns in the resulting nets — even as the number of centroids tends to infinity, in stark contrast with 
the continuous case — and can be seen as competing uncoupled nets. The discrete case is then qualitatively richer 
in the sense that it can present behaviours that do not have a correlate in the continuous case. 

The original elastic net corresponds to the first-order forward-difference stencil. As a corollary of the analysis 
we have shown that the tension term in this case can be exactly rewritten in a different way to the sum-of-square- 
distances form. This alternative form has a Mexican-hat shape and results from evenising the stencil subject 
to having the same power spectrum. Higher-order stencils add more oscillations to this Mexican hat. In the 
context of previous elastic net work, it demonstrates the equivalence of squared wire length (in stimulus space) 
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and Mexican-hat lateral interaction — and so that what seemingly looks like a sparse pattern of nearest-neighbour 
connections can be implemented with a denser net of connections (whose strength decays rapidly with distance). 
This brings the elastic net closer to most other cortical map models (e.g. to that of Miller et al., 1989), in which 
a dense pattern of connections with a Mexican-hat form is a crucial component. 

Finally, we have studied the cortical maps that arise with the forward-difference family. Our Cholesky-based 
deterministic annealing algorithm allows to explore a wide range of tension terms and so provide an empirical 
phase diagram for the maps. Such maps result from the interaction of the fitness and tension terms and we have 
given a qualitative explanation of the striped maps based on a drifting cutoff argument and the stencil spectrum. 
This argument predicts that the resulting nets show a periodic structure where the frequency increases with the 
derivative order and decreases with the strength of the tension term. The power spectrum of the stencil explains 
the anisotropy of the maps. A full theoretical explanation of map development with arbitrary derivatives, in 
particular the geometric interrelations between maps in 2D nets, is a subject for future research. 
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A Normalisation constant of the prior density 

The Gaussian prior on the centroids is proper if S is positive definite and improper if S is positive semidefinite 
(otherwise, S has a negative eigenvalue and so the prior cannot be normalised since it diverges exponentially 
along some directions). The full expression of the prior is: 

P (Y;/?)=(A)%ife-^-) (16) 

where r = f rank (S) and |S| + is the product of the r positive eigenvalues of S. If S is positive definite then r = M 
and |S| + is just the determinant of S. If S is positive semidefinite, and so has one or more zero eigenvalues, 
then the covariance 5] = (/3S) _1 is finite along r eigenvectors and infinite along the rest, so the density is taken 
as 1 along those directions. Specifically, calling y a row of Y and spectrally decomposing T, = UAU T with U 
orthogonal, A = ^ 2 ) with diagonal blocks and A^" 1 = 0, and defining z = U T y: 

p(y) oc exp (-i y ^S-V) = exp (-l(U T yf A" 1 ^)) = exp ^-zjA^ exp (-^A^) (17) 

where the term in A2 is 1. Thus, p(Y;/3) is given by eq. (16) along directions of imS, with dimension r; and 
p(Y; j3) = 1 improper along directions of ker S, with dimension M — r. 

B Stencil normalisation to unit power 

In order to compare different stencils we need to normalise them, as discussed in section 5.4. Since the eigenvectors 
of the stencil matrix are plane waves, we will normalise in the Fourier domain. By convention, we will assume 
that all stencils have unit power, i.e., that the integral of the power spectrum is one. This means that if a 
stencil favours high frequencies then it must compensate by disfavouring low ones. This is the same as forcing 
an impulse (delta function) to have unit power in Fourier space after being filtered by the stencil. Therefore we 
need to calculate the power of a given stencil. 

Since the domain is discrete, it would be possible to consider the sum of the powers at each frequency rather 
than the integral, but by considering the stencil as a sum of delta functions, and thus defined over the real space, 
we can introduce the discretisation level of the net and so compare nets with different resolutions (e.g. the same 
piece of cortex at different resolutions). Besides, it turns out that both approaches — the discrete sum and the 
continuous integral — give exactly the same result for the power. 

First we consider the continuous approach, i.e., we consider the stencil as a continuous function that is nonzero 
at the nodes of the grid only. Consider then an L-dimensional net with Mi x M2 x • • • x Ml centroids that are a 
regular sample of an L-dimensional hyperrectangle of lengths W\ , . . . , Wl with Wi = hM\ and h G M + the step 
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size. That is, we consider square voxels of size h L (in net space) for simplicity. The stencil is defined over R L by 
the delta-mixture function 

def 

q(u) = 2^ c m £( u - u m ) u m = Jim = — m 

m 1 

where we use boldface for real vectors and vectors of indices, e.g. m = (mi, . . . , ttil) t - The origin of the coordinate 
system is irrelevant. The continuous Fourier transform of c is then the plane- wave superposition 

r m i M-i k w 

J ^ L m=0 m=0 

where u is a length and k an inverse length (in L dimensions). Both <^(u) and <f(k) are defined in R L and, while 
^(u) = except at each u m , <f(k) ^ in general. If we define ki = kiWi at the integer values 0, . . . , Mi — 1, then 
{U^To 1 is the DFT of {<; m }^. 

The continuous power spectrum, defined in R L , is 



p(k) d = f i<?(k)i 2 



(*) 



M-l 


2 M-l 






m=0 


m,n=0 


M-l 


M-l 


^2 ^m^nCOs27Tk T (u m - U n ) = ^ ^ m + ^ ^m^n COS 27rk T (u m 


m . n — 


m=0 m/n 



(18) 



where step (*) holds because P(k) is real. The total power spectrum over R L diverges, but since £(k) is periodic, 
we can concentrate on the average power in a single period: 



M 1 M L Mi M L 



Mi M L 
W 1 f W L 



p d = f [ Wl ...[ WL P<Mdk/ [ Wl ...[ WL dk. 

Jo Jo Jo Jo 

To compute P, note that the cosine term in (18) cancels: changing variables we get 

cos27rk T (u m — u n ) dk = constant x / . . . / cos (mi — ni)xi = 

Jo Jo \ l=1 J 

by using cos (a + /?) = cos a cos j3 — sin a sin f3 and factoring out variables, and recalling that mi — ni G Z. Thus, 
the expression for the average power in a period, valid for any stencil and any boundary condition, is simply 
P ~ Xlm=o i- e -5 the stencil squared modulus (equal to tr (S) for the circulant case). When we have several 
stencils qj (section 5.7.2), since the power is additive we get 

M-l 

j j m=0 

As mentioned above, the same result for P is obtained by the discrete approach, i.e., approximating the integral 
by a sum over a grid of voxel Ak = k^ + i — k^ = (^-, • • • , ~wz) T m Fourier space we obtain 



M-l M-l 

vol ( Ak) 



Ill 

=0 m=0 



again by Parseval's theorem. 

Finally, we must take into account the step size for the finite difference (see eq. (9)) and use q m h~ p in place 
of ^ m . For example, for eq. (9) with the second-order forward difference we have 7^2 (1, —2, 1). Thus we obtain 
the final expression for the average power of the stencil q: 



M-l 



m 
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which provides the link with a continuous net with physical dimensions (e.g. a piece of cortex). However, in this 
paper we assume h = 1 and use a fixed resolution, since we are interested in comparing different values of f3 and 
the order p over the same net (i.e., at the same discretisation level). 

In summary, the normalisation we use amounts to dividing the stencil by its squared modulus, and this is 
what we do in figures such as 3 and 14. For the forward- difference family in ID (section 5.5.1), the stencil squared 
modulus is ( 2 J), which grows exponentially as 2 2p / v /7fp. For the central-difference family (section 5.5.2), it is 

w( 2 /)' wn i cn decreases as \j sjwp. 



C Fourier transforms 

We use the following formulation for the continuous and discrete Fourier transforms, respectively: 

/oo roo 
y(u)e- i27rku du y(u) = y(k)e i27rku dk u, k € (-oo, oo) 
-CO J — OO 



-OO 

M-l , M-l 



'k\ 2 



Discrete: y k d = f ^ j/ m e" <27r * m y m = ^ ^ Vk^ m ™, k = 0, . . . , M - 1. 

m=0 k=0 

We use several well-known properties of the Fourier transform, such as Parseval's theorem: 

,oo noo M-l M-l 

Continuous: / \y(u)\ 2 du = \y(k)\ 2 dk Discrete: ^ \ym\ 2 = V7 ^ \y, 

m=0 fc=0 

or the derivative theorem (continuous case only): 

y^\u) 4 (i2irk) p y(k). 

More details can be found in standard texts (Bracewell, 2000; Papoulis, 1962). Note that the mathematical 
expressions may differ by a constant factor depending on the Fourier transform formulation used. 



D Proofs 

In the propositions of section 5.6 (and occasionally elsewhere) we encounter some series and integrals whose 
value can be found in mathematical handbooks, in particular Prudnikov et al. (1986) (abbreviated PBM) and 
Gradshteyn and Ryzhik (1994) (abbreviated GR). 

Proof of proposition 5.1. By substitution. For a = 0, . . . , M — 1 and m = 0, . . . , M — 1: 

M-l M-l a-1 M-l 

(Df m ) a = ^2 dafifmP = ^2 ^(/3-c0 mod M^m = + ^ 

(3=0 (3=0 (3=0 (3=a 

M-l M-a-1 /M-l \ 

= E d ^l +a ~ M + E « +a = E d 7< < = A m / m « 

7=M-a 7=0 V 7=0 / 

where we have changed 7 = /? — a + M and 7 = /3 — a in the first and second summations, respectively. 

It is instructive to see that if a Toeplitz matrix A has plane- wave eigenvectors then it must be circulant. 
Assuming a nm = a m _ n and trying a plane- wave eigenvector f = (fp) with fp = e lk ^: 

M-l M-l M-l M-a-1 

(Af) a = E = E «/5-« eifc/J = eika E ^-« etttf " a) = A «/« fOT A « = E 

(3=0 (3=0 (3=0 7=-« 

Now A a must be independent of a, so for all a = 0, . . . , M — 2 we must have 



M-a-1 M-a-2 

ik^ (*) ,„ x , LVJ , . . . 

dM-a-ie a- a -ie 

7= — a 7= — ol — 1 



777/ 

=> = 2tt— for 777 = 0, . . . , M - 1 and a M -a-i = a~a-i a a p = a (/3 _ a) mod M 
where in step (*) the inner summands cancel. □ 



45 



Proof of proposition 5. 6. It is straightforward to see that D T is circulant with eigenvectors f m associated with 
eigenvalues A^. Hence: 

Sf m = D T Df m = A m D T f m = A m A m f m = |A m | f m 

and likewise Sf^ = |A m | 2 f^. So f m and are eigenvectors of S associated with the real eigenvalue v m = 
|A m | 2 > 0. Thus, v m = |(f m + f^) = 5R(f m ), with v mn = cos (27r-^n), and w m = ^(f m - f^) = 3?(f m ), with 
Wmn = sin (27r^n), are eigenvectors associated with v m . 

In terms of the first row of S, (so>si> • • • ,%-i), the eigenvalues are (noting that «s n = sm-u because S is 
symmetric), for m = 0, . . . , M — 1: 

M-l M-l 1 M-l M-l 

2^m = (^n + %-n)< = ^ + ^ ^ = ^ + CJ~ n ) = 2 ^ 5 n COS fa^nj 

n=0 n=0 Z=M n=0 n=0 

where we have changed I — M — n and so = 5m- The first row of S can be seen to be as follows in terms of that 
of D: 

M-l M-l m-l M-l 

5'0rn = 5 m = ^ diodim = ^ G?zd(/_ m ) moc l M = ^ djdl-rn+M + ^ didj- rn . (19) 
Z=0 Z=0 £=0 Z=ra 

□ 

Proof of proposition 5.13. The DFT <f of the left-aligned stencil (i.e., padded with zeroes to the right) c = 
(^0,^1, • • • ,Cm-i) is 

M-l M-l 

» = E ^ e ~ i2 ^ m = E ^ m fc = o, i, . . . , m — i. 

Therefore q~ times an offset phase factor e* 27r ^ m ' (for some integer m') is equal to A£ of the D matrix and 
the power spectrum is |^| 2 = |Afc| 2 = of the matrix S. Note that the power spectrum is invariant to shifts 
n — » m + m', which corresponds to the invariance of S to row permutations in D. □ 

Proof of proposition 5. 14- By Parseval's theorem we have P = E^Lo* \$k f = M E m and since the eigenvalues 
of S are = |Qc| 2 , we have tr (S) = Efclo 1 u k = ^"E m The proposition can also be proven by noting that 

for n = 0, . . . , M - 1, S nn = S = Erf 4o = Em=0 and SO tr ( S ) = Ent^ *nn = M E m 4' D 

Proof of proposition 5.17. The sawtooth frequency is m = ^ and so wm = e l7T = — 1. We have that its associated 
eigenvalue in the D matrix is A m = En^C) 1 d n (—l) n and its power in the Fourier domain is \\m. | 2 . Since (do, • • • ) 
is the stencil padded with zeroes, Am is zero when E n even $ n = E n odd $ n . Since the stencil is differential it must 

Satisfy En ^ = 0, SO En even = E n odd ?n = 0. 

The same result is obtained in the net domain by forcing <; */ to be identically zero, where / = (1,-1,1,-1,. ..,1,-1) 
is the sawtooth wave. □ 

Proof of proposition 5.18. Analogous to the ID case. The 2D DFT of the stencil c = (c mn ) is 

M-l iV-1 

& d = f E E We- i2 -(^ m+ ^") 

ra=0 n=0 

which gives 

M-l AT- 1 M— 1 N—l 

%,f = E E we-^ = E E W(-1)^. 

m=0 n=0 m=0 n=0 

Since c is differential, it also verifies E!JoEn=o^n = 0, hence ^ = if and only if E m +n even^ = 

X^m+n odd ^rnn =0. □ 

Proof of proposition 5.21. Obviously V* V = V* V (note the respective circulant matrices commute too). Now, 
from eq. (9) with a p = 1: 
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where G{h q ) = f( q+q )(£)a g+g /ft g is also a function of t since £ depends on t. Then: 

_ ((V*V) */)(*) _ ^ 



ftp 



ftp+4 



— (fW(t) + 0(h*))+0(h^ 



f(p+Q)(t) + (9(ft g/ ) + e>(//) = + o(ft min(p/>g/) ). 



□ 



Proof of proposition 5.23. By definition of matrix associated with a stencil, Dy = ^ *y and Ey = g *y for any 
y G M M . Thus DEy = V * (V * y) = (V * V) * y- □ 

Proof of proposition 5.25. Call = V * V- Then 1i7 m = ^n^m-n and 

5^^(-ir = EVnE^-(- 1 ) m = E Vn(-i)- n ^ tn(-ir = o. 



□ 

Proof of proposition 5.26. By induction. For p = 0, necessarily m = and so (°^o = 1 (the delta function). 
Assume that the statement holds for ^ p \. Then: 



(p+i) 



m = : 
m = 1, . 

m = p : 



(-1)^ 

,P - 1 : + W^-i = -(-l) m+P D + (-l) m - 1+P L P _i) 

( _ lr+ p + l + (m p J) = ( _ lr+ p + l ( p+l) 



□ 



Proo/ of proposition 5.27. Y,l=l iP ^l = ELo = (?) ( GR °- 157:1 )' Since tr ( b)s ) = "n, this also 



proves the formula ^f^ 1 (2sin {ir^)f P = M( 2 /). 



□ 



Proof of proposition 5.28. By induction. For p = 0, necessarily m = and so (°\o = 1 (the delta function). 
Assume that the statement holds for ( p \. Then: 



' m = : 
m = 1 : 

ra = 2,...,2p: 

m = 2p + 1 : 
,ra = 2p + 2 : 



2p+ 



iWft = o 



.i(p) c +i(p) c 



1 

2 p+i • 



m = 2n + 1 odd: 

m = 2n even: -i(-l)P+^)^L + I (-1)^-1 ^ 

(-iri p r)^ 



□ 



Proof of proposition 5.30. Since S is real circulant, VM-k — v^. 

(=>) Since z/ m = VM-m = the eigenvalues are real. Diagonalising S = -^FNF*, then S T = = (S*) T 
S. 

(<=) Since S is symmetric, its eigenvalues are real, so v% = v k . Thus v k = i^M-k- D 



jgFNF 



Proof of proposition 5.34- Consider the integral / = f J* f(x)dx where / is a continuous function in [0,7r]. We 
can construct a Riemann sum that converges to the integral (Apostol, 1957) by dividing the interval [0, tt] into 
M bins of width Ax = with x k = kAx: 



M-l 



M-l 



1 = J im E f( x k) Ax = j im j7 E ffa)' 



k=0 



M^oo M 



k=0 
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Thus, for ^d m as given by eq. (12) we have 

= — I(p,m) with I(p, m) = / sin p x cos 2mx dx m = 0, 1, 2, . . . , oo. 
n Jo 

Gradshteyn and Ryzhik (1994) give 7 : 



GR 3.631:12: sin 2n x cos 2mx dx 
GR 3.631:13: sin 2n+1 x cos 2mx ^ = < 



(-ir2-^( n -J 7 r, n> 



m 



0, n < m 

( (-l) m 2 n+1 n!(2n + l)!! 
(2n-2ra + l)!!(2ra + 2n + l)!!' 
(_l)n+i 2 n+i n! ( 2n - 2m - 3)!! (2n + 1)!! 



n > m — 1 
n < m — 1. 



(2ra + 2n + l)!! 

We can simplify GR 3.631:13 (using induction on n for the case n > m — 1) to give 

v (-l) n+1 2 n+1 n!(2n + l)!! 
J(p,m) = n n =o(4m2 _ (2fc + 1) 2) - = 0,1,2,... 

The statement now follows trivially. □ 
Proof of proposition 5.35. By summing the following geometric series for n = 0, . . . , M — 1: 



m=0 



'M, n = 
0, n > even 
2 --n-, n odd 



and taking its imaginary part we obtain 



m^i m (O, neven 

2^ Sin (TT-n = i sin(^) _ l+cosfrft) _ f x 

m=0 [l-cos(^)- sin(^) - CO H^2mJ' ™ OCW - 

The statement follows by applying this to 

M_1 / k \ ( m \ M ~ 1 1 / / k \ ( k 

Esin I 7r— - I cos ( 2tt— -k ) = > - I sin I n—-(2m + 1) I — sin I 7r — -(2m — 1) 
V My V M J ^ 2\ \ M V V M J 

k=0 v 7 k=0 v v 7 v 

and simplifying. □ 
Proof of proposition 5.36. Trivial by induction on n, noting that 

(2n+l) 7oo = ^ V (2n-l) 7oo 

m 4m 2 - (2n + l) 2 

(for some if > from proposition 5.34) and so ^ 2n+1 ^<i^ has the same sign as ^ 2n ~ 1 ^d < ^ for m < n and the 
opposite sign for m > n. □ 

Proof of proposition 5.37. Analogously to the proof of proposition 5.34, for ^d m as given by eq. (14) we have 

1 f 7 * 
^d^ = —I(p,m) with I(p, m) = / |sin p 2x\ cos 2mx dx m = 0, 1, 2, . . . , oo. 
n Jo 

For odd m, the integral vanishes because the integrand is an odd function of x around x = ^. For m even, the 
integrand is even, and changing variables x to | the integral becomes JJ 1 " sin p cos m?/ which can be calculated 
with GR 3.631:12-13 as before. " 1 □ 



7 Prudnikov et al. (1986) give a correct value for the first integral (PBM1 2.5.12:38-39) but an incorrect one for the second 
(PBM1 2.5.12:13-16,37). 
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